Re: [ceph-users] New cluster in unhealthy state
Try ceph osd pool set rbd pgp_num 310 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Dave Durkee Sent: 19 June 2015 22:31 To: ceph-users@lists.ceph.com Subject: [ceph-users] New cluster in unhealthy state I just built a small lab cluster. 1 mon node, 3 osd nodes with 3 ceph disks and 1 os/journal disk, an admin vm and 3 client vm's. I followed the preflight and install instructions and when I finished adding the osd's I ran a ceph status and got the following: ceph status cluster b4419183-5320-4701-aae2-eb61e186b443 health HEALTH_WARN 32 pgs degraded 64 pgs stale 32 pgs stuck degraded 246 pgs stuck inactive 64 pgs stuck stale 310 pgs stuck unclean 32 pgs stuck undersized 32 pgs undersized pool rbd pg_num 310 pgp_num 64 monmap e1: 1 mons at {mon=172.17.1.16:6789/0} election epoch 2, quorum 0 mon osdmap e49: 11 osds: 9 up, 9 in pgmap v122: 310 pgs, 1 pools, 0 bytes data, 0 objects 298 MB used, 4189 GB / 4189 GB avail 246 creating 32 stale+active+undersized+degraded 32 stale+active+remapped ceph health HEALTH_WARN 32 pgs degraded; 64 pgs stale; 32 pgs stuck degraded; 246 pgs stuck inactive; 64 pgs stuck stale; 310 pgs stuck unclean; 32 pgs stuck undersized; 32 pgs undersized; pool rbd pg_num 310 pgp_num 64 ceph quorum_status {election_epoch:2,quorum:[0],quorum_names:[mon],quorum_leader_name :mon,monmap:{epoch:1,fsid:b4419183-5320-4701-aae2-eb61e186b443,mo dified:0.00,created:0.00,mons:[{rank:0,name:mon,addr :172.17.1.16:6789\/0}]}} ceph mon_status {name:mon,rank:0,state:leader,election_epoch:2,quorum:[0],out side_quorum:[],extra_probe_peers:[],sync_provider:[],monmap:{epoch: 1,fsid:b4419183-5320-4701-aae2-eb61e186b443,modified:0.00,creat ed:0.00,mons:[{rank:0,name:mon,addr:172.17.1.16:6789\/0}] }} ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 4.94997 root default -2 2.24998 host osd1 0 0.45000 osd.0 down0 1.0 1 0.45000 osd.1 down0 1.0 2 0.45000 osd.2 up 1.0 1.0 3 0.45000 osd.3 up 1.0 1.0 10 0.45000 osd.10 up 1.0 1.0 -3 1.34999 host osd2 4 0.45000 osd.4 up 1.0 1.0 5 0.45000 osd.5 up 1.0 1.0 6 0.45000 osd.6 up 1.0 1.0 -4 1.34999 host osd3 7 0.45000 osd.7 up 1.0 1.0 8 0.45000 osd.8 up 1.0 1.0 9 0.45000 osd.9 up 1.0 1.0 Admin-node: [root@admin test-cluster]# cat ceph.conf [global] auth_service_required = cephx filestore_xattr_use_omap = true auth_client_required = cephx auth_cluster_required = cephx mon_host = 172.17.1.16 mon_initial_members = mon fsid = b4419183-5320-4701-aae2-eb61e186b443 osd pool default size = 2 public network = 172.17.1.0/24 cluster network = 10.0.0.0/24 How do I diagnose and solve the cluster health issue? Do you need any additional information to help with the diag process? Thanks!! Dave ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados gateway to use ec pools
Just configure '.rgw.buckets' as an EC pool and rest of the rgw pools should be replicated. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Deneau, Tom Sent: Friday, June 19, 2015 2:31 PM To: ceph-users@lists.ceph.com Subject: [ceph-users] rados gateway to use ec pools what is the correct way to make radosgw create its pools as erasure coded pools? -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph EC pool performance benchmarking, high latencies.
-Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: 19 June 2015 13:44 To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph EC pool performance benchmarking, high latencies. On 06/19/2015 07:28 AM, MATHIAS, Bryn (Bryn) wrote: Hi All, I am currently benchmarking CEPH to work out the correct read / write model, to get the optimal cluster throughput and latency. For the moment I am writing 4Mb files to an EC 4+1 pool with a randomised name using the rados python interface. Load generation is happening on external machines. Write generation is characterised as the number of IOContexts and the number of simultaneous async writes on those contexts. With one machine, IOContexts threads and 50 simultaneous writes per context I achieve over 300 seconds: Percentile 5 = 0.133775639534 Percentile 10 = 0.178686833382 Percentile 15 = 0.180827605724 Percentile 20 = 0.185487747192 Percentile 25 = 0.229317903519 Percentile 30 = 0.23066740036 Percentile 35 = 0.232764816284 Percentile 40 = 0.278827047348 Percentile 45 = 0.280579996109 Percentile 50 = 0.283169865608 Percentile 55 = 0.329843044281 Percentile 60 = 0.332481050491 Percentile 65 = 0.380337607861 Percentile 70 = 0.428911447525 Percentile 75 = 0.438932359219 Percentile 80 = 0.530071306229 Percentile 85 = 0.597331762314 Percentile 90 = 0.735066819191 Percentile 95 = 1.08006491661 Percentile 100 = 11.7352428436 Max latancies = 11.7352428436, Min = 0.0499050617218, mean = 0.43913059745 Total objects writen = 24552 in time 302.979903936s gives 81.0350775118/s (324.140310047 MB/s) From two load generators on separate machines I achieve: Percentile 5 = 0.228541088104 Percentile 10 = 0.23213224411 Percentile 15 = 0.279508590698 Percentile 20 = 0.28137254715 Percentile 25 = 0.328829288483 Percentile 30 = 0.330499911308 Percentile 35 = 0.334045898914 Percentile 40 = 0.380131435394 Percentile 45 = 0.382810294628 Percentile 50 = 0.430188417435 Percentile 55 = 0.43399245739 Percentile 60 = 0.48120136261 Percentile 65 = 0.530511438847 Percentile 70 = 0.580485081673 Percentile 75 = 0.631661534309 Percentile 80 = 0.728989124298 Percentile 85 = 0.830820584297 Percentile 90 = 1.03238985538 Percentile 95 = 1.62925363779 Percentile 100 = 32.5414278507 Max latancies = 32.5414278507, Min = 0.0375339984894, mean = 0.863403101415 Total objects writen = 12714 in time 325.92741394s gives 39.0086855422/s (156.034742169 MB/s) Percentile 5 = 0.229072237015 Percentile 10 = 0.247376871109 Percentile 15 = 0.280901908875 Percentile 20 = 0.329082489014 Percentile 25 = 0.331234931946 Percentile 30 = 0.379406833649 Percentile 35 = 0.381390666962 Percentile 40 = 0.429595994949 Percentile 45 = 0.43164896965 Percentile 50 = 0.480262041092 Percentile 55 = 0.529169607162 Percentile 60 = 0.533170747757 Percentile 65 = 0.582635164261 Percentile 70 = 0.634325170517 Percentile 75 = 0.72939991951 Percentile 80 = 0.829002094269 Percentile 85 = 0.931713819504 Percentile 90 = 1.18014221191 Percentile 95 = 2.08048944473 Percentile 100 = 31.1357450485 Max latancies = 31.1357450485, Min = 0.0553231239319, mean = 1.03054529335 Total objects writen = 10769 in time 328.515608788s gives 32.7807863978/s (131.123145591 MB/s) Total = 278Mb/s The combined test has much higher latencies and a less than half throughput per box. If I scale this up to 5 nodes all generating load I see the throughput drop to ~50MB/s and latencies up to 60 seconds. An example slow write from dump_historic_ops is: description: osd_op(client.1892123.0:1525 \/c18\/vx1907\/kDDb\/180\/4935.ts [] 6.f4d68aae ack+ondisk+write+known_if_redirected e523), initiated_at: 2015-06-19 12:37:54.698848, age: 578.438516, duration: 38.399151, type_data: [ commit sent; apply or cleanup, { client: client.1892123, tid: 1525 }, [ { time: 2015-06-19 12:37:54.698848, event: initiated }, { time: 2015-06-19 12:37:54.856361, event: reached_pg }, { time: 2015-06-19 12:37:55.095731, event: started }, { time: 2015-06-19 12:37:55.103645, event: started }, { time: 2015-06-19 12:37:55.104125, event: commit_queued_for_journal_write
Re: [ceph-users] Very chatty MON logs: Is this normal?
On 06/19/2015 11:16 AM, Daniel Schneller wrote: On 2015-06-18 09:53:54 +, Joao Eduardo Luis said: Setting 'mon debug = 0/5' should be okay. Unless you see that setting '/5' impacts your performance and/or memory consumption, you should leave that be. '0/5' means 'output only debug 0 or lower to the logs; keep the last 1000 debug level 5 or lower in memory in case of a crash'. Your logs will not be as heavily populated but, if for some reason the daemon crashes, you get quite a few of debug information to help track down the source of the problem. Great, will do. Just for my understanding re/ memory: If this is a ring buffer for the last 1 events, shouldn't that be a somewhat fixed amount of memory? How would it negatively affect the MON's consumption? Assuming it works that way, once they have been running for a few days or weeks, these buffers would be full of events anyway, just more aged ones if the memory level was lower? Daniel From briefly taking a peak at 'src/log/*', this looks like it is a linked list rather than a buffer ring. So, given it will always be capped at 10k events, there's a fixed amount of memory it will consume in the worst case (when you have 10k events). But if you have bare minimum activity in the logs, said memory consumption should be lower, or at most slowly growing as the queue grows. Although I was not obvious, my initial thought was that someone with debug levels set at 0/0 would certainly be surprised if, after setting 0/5, the daemon's memory consumption started to grow. In retrospect, 10k log messages should not take more than a handful of MBs, and should not have any impact at all as long as you're not provisioning your monitor's memory in the dozens of MBs. -Joao ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] New cluster in unhealthy state
I just built a small lab cluster. 1 mon node, 3 osd nodes with 3 ceph disks and 1 os/journal disk, an admin vm and 3 client vm's. I followed the preflight and install instructions and when I finished adding the osd's I ran a ceph status and got the following: ceph status cluster b4419183-5320-4701-aae2-eb61e186b443 health HEALTH_WARN 32 pgs degraded 64 pgs stale 32 pgs stuck degraded 246 pgs stuck inactive 64 pgs stuck stale 310 pgs stuck unclean 32 pgs stuck undersized 32 pgs undersized pool rbd pg_num 310 pgp_num 64 monmap e1: 1 mons at {mon=172.17.1.16:6789/0} election epoch 2, quorum 0 mon osdmap e49: 11 osds: 9 up, 9 in pgmap v122: 310 pgs, 1 pools, 0 bytes data, 0 objects 298 MB used, 4189 GB / 4189 GB avail 246 creating 32 stale+active+undersized+degraded 32 stale+active+remapped ceph health HEALTH_WARN 32 pgs degraded; 64 pgs stale; 32 pgs stuck degraded; 246 pgs stuck inactive; 64 pgs stuck stale; 310 pgs stuck unclean; 32 pgs stuck undersized; 32 pgs undersized; pool rbd pg_num 310 pgp_num 64 ceph quorum_status {election_epoch:2,quorum:[0],quorum_names:[mon],quorum_leader_name:mon,monmap:{epoch:1,fsid:b4419183-5320-4701-aae2-eb61e186b443,modified:0.00,created:0.00,mons:[{rank:0,name:mon,addr:172.17.1.16:6789\/0}]}} ceph mon_status {name:mon,rank:0,state:leader,election_epoch:2,quorum:[0],outside_quorum:[],extra_probe_peers:[],sync_provider:[],monmap:{epoch:1,fsid:b4419183-5320-4701-aae2-eb61e186b443,modified:0.00,created:0.00,mons:[{rank:0,name:mon,addr:172.17.1.16:6789\/0}]}} ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 4.94997 root default -2 2.24998 host osd1 0 0.45000 osd.0 down0 1.0 1 0.45000 osd.1 down0 1.0 2 0.45000 osd.2 up 1.0 1.0 3 0.45000 osd.3 up 1.0 1.0 10 0.45000 osd.10 up 1.0 1.0 -3 1.34999 host osd2 4 0.45000 osd.4 up 1.0 1.0 5 0.45000 osd.5 up 1.0 1.0 6 0.45000 osd.6 up 1.0 1.0 -4 1.34999 host osd3 7 0.45000 osd.7 up 1.0 1.0 8 0.45000 osd.8 up 1.0 1.0 9 0.45000 osd.9 up 1.0 1.0 Admin-node: [root@admin test-cluster]# cat ceph.conf [global] auth_service_required = cephx filestore_xattr_use_omap = true auth_client_required = cephx auth_cluster_required = cephx mon_host = 172.17.1.16 mon_initial_members = mon fsid = b4419183-5320-4701-aae2-eb61e186b443 osd pool default size = 2 public network = 172.17.1.0/24 cluster network = 10.0.0.0/24 How do I diagnose and solve the cluster health issue? Do you need any additional information to help with the diag process? Thanks!! Dave ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rados gateway to use ec pools
what is the correct way to make radosgw create its pools as erasure coded pools? -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Build latest KRBD module
Hi, guys! Do we have any procedure on how to build the latest KRBD module? I think it will be helpful to many people here. Regards, Vasily. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs unmounts itself from time to time
On 19 June 2015 at 13:46, Gregory Farnum g...@gregs42.com wrote: On Thu, Jun 18, 2015 at 10:15 PM, Roland Giesler rol...@giesler.za.net wrote: On 15 June 2015 at 13:09, Gregory Farnum g...@gregs42.com wrote: On Mon, Jun 15, 2015 at 4:03 AM, Roland Giesler rol...@giesler.za.net wrote: I have a small cluster of 4 machines and quite a few drives. After about 2 -3 weeks cephfs fails. It's not properly mounted anymore in /mnt/cephfs, which of course causes the VM's running to fail too. snip I'm under the impression that CephFS is the filesystem implimented by ceph-fuse. Is it not? Of course it is, but it's a different implementation than the kernel client and often has different bugs. ;) Plus you can get a newer version of it easily. Let me look into it and see how it might help me. Other than that, can you include more information about exactly what you mean when saying CephFS unmounts itself? Everything runs fine for weeks. Then suddenly a user reports that a VM is not functioning anymore. On investigation is transpires than CephFS is not mounted anymore and the error I reported is logged. I can't see anything else wrong at this stage. ceph is running, the osd are all up. Maybe one of our kernel devs has a better idea but I've no clue how to debug this if you can't give me any information about how CephFS came to be unmounted. It just doesn't make any sense to me. :( I'll go through the logs again and find the point where it happens and post it. - Roland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] reversing the removal of an osd (re-adding osd)
Hello everybody, I'm doing some experiments and I am trying to re-add an removed osd. I removed it with the bellow five commands. http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ ceph osd out 5 /etc/init.d/ceph stop osd.5 ceph osd crush remove osd.5 ceph auth del osd.5 ceph osd rm 5 I think I added the auth back correctly, but I cant figure out the right crush add commands? ceph auth add osd.5 osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-5/keyring root@ceph03:~# /etc/init.d/ceph start osd.5 === osd.5 === Error ENOENT: osd.5 does not exist. create it before updating the crush map failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.5 --keyring=/var/lib/ceph/osd/ceph-5/keyring osd crush create-or-move -- 5 0.91 host=ceph03 root=default' Can somebody show me some examples of the right commands to re-add? Kind regards, Jelle de Jong ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] reversing the removal of an osd (re-adding osd)
On 19/06/15 16:07, Jelle de Jong wrote: Hello everybody, I'm doing some experiments and I am trying to re-add an removed osd. I removed it with the bellow five commands. http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ ceph osd out 5 /etc/init.d/ceph stop osd.5 ceph osd crush remove osd.5 ceph auth del osd.5 ceph osd rm 5 I think I added the auth back correctly, but I cant figure out the right crush add commands? ceph auth add osd.5 osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-5/keyring root@ceph03:~# /etc/init.d/ceph start osd.5 === osd.5 === Error ENOENT: osd.5 does not exist. create it before updating the crush map failed: 'timeout 30 /usr/bin/ceph -c /etc/ceph/ceph.conf --name=osd.5 --keyring=/var/lib/ceph/osd/ceph-5/keyring osd crush create-or-move -- 5 0.91 host=ceph03 root=default' Can somebody show me some examples of the right commands to re-add? I figured it out myself :) root@ceph03:~# ceph osd create 5 root@ceph03:~# ceph osd crush add 5 0.0 host=ceph03 root=default add item id 5 name 'osd.5' weight 0 at location {host=ceph03,root=default} to crush map root@ceph03:~# /etc/init.d/ceph start osd.5 === osd.5 === create-or-move updated item name 'osd.5' weight 0.91 at location {host=ceph03,root=default} to crush map Starting Ceph osd.5 on ceph03... starting osd.5 at :/0 osd_data /var/lib/ceph/osd/ceph-5 /var/lib/ceph/osd/ceph-5/journal Kind regards, Jelle de Jong ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] incomplete pg, recovery some data
On Thu, Jun 18, 2015 at 01:24:38PM +0200, Mateusz Skała wrote: Hi, After some hardware errors one of pg in our backup server is 'incomplete'. I do export pg without problems like here: https://ceph.com/community/incomplete-pgs-oh-my/ After remove pg from all osd's and import pg to one of osd pg is still 'incomplete'. I want to recover only some pice of data from this rbd so if I lost something then nothing happened. How can I tell ceph to accept this pg as complete and clean? I have a patch for ceph-objectstore-tool, which adds mark-complete operation, as it has been suggested by Sam in http://tracker.ceph.com/issues/10098 https://github.com/ceph/ceph/pull/5031 It has not been reviewed yet and not tested well though because I don't know a simple way how to get an incomplete pg. You might want to try it on your own risk. -- Mykola Golub ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: Re: Unexpected disk write activity with btrfs OSDs
On 06/19/15 13:42, Burkhard Linke wrote: Forget the reply to the list... Forwarded Message Subject: Re: [ceph-users] Unexpected disk write activity with btrfs OSDs Date: Fri, 19 Jun 2015 09:06:33 +0200 From: Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de To: Lionel Bouton lionel+c...@bouton.name Hi, On 06/18/2015 11:28 PM, Lionel Bouton wrote: Hi, *snipsnap* - Disks with btrfs OSD have a spike of activity every 30s (2 intervals of 10s with nearly 0 activity, one interval with a total amount of writes of ~120MB). The averages are : 4MB/s, 100 IO/s. Just a guess: btrfs has a commit interval which defaults to 30 seconds. You can verify this by changing the interval with the commit=XYZ mount option. I know and I tested commit intervals of 60 and 120 seconds without any change. As this is directly linked to filestore max sync interval I didn't report this test result. Best regards, Lionel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Explanation for ceph osd set nodown and ceph osd cluster_snap
Hi Jan, On 06/18/2015 12:48 AM, Jan Schermer wrote: 1) Flags available in ceph osd set are pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub|notieragent I know or can guess most of them (the docs are a “bit” lacking) But with ceph osd set nodown” I have no idea what it should be used for - to keep hammering a faulty OSD? I only know the documentation for this one: http://ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ You can set an OSD to nodown if you know for certain that it is not faulty but it gets set to this state by the monitor because of problems with the cluster network. Cheers, Carsten 2) looking through the docs there I found reference to ceph osd cluster_snap” http://ceph.com/docs/v0.67.9/rados/operations/control/ what does it do? how does that work? does it really work? ;-) I got a few hits on google which suggest it might not be something that really works, but looks like something we could certainly use Thanks Jan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com smime.p7s Description: S/MIME Cryptographic Signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph EC pool performance benchmarking, high latencies.
Hi All, I am currently benchmarking CEPH to work out the correct read / write model, to get the optimal cluster throughput and latency. For the moment I am writing 4Mb files to an EC 4+1 pool with a randomised name using the rados python interface. Load generation is happening on external machines. Write generation is characterised as the number of IOContexts and the number of simultaneous async writes on those contexts. With one machine, IOContexts threads and 50 simultaneous writes per context I achieve over 300 seconds: Percentile 5 = 0.133775639534 Percentile 10 = 0.178686833382 Percentile 15 = 0.180827605724 Percentile 20 = 0.185487747192 Percentile 25 = 0.229317903519 Percentile 30 = 0.23066740036 Percentile 35 = 0.232764816284 Percentile 40 = 0.278827047348 Percentile 45 = 0.280579996109 Percentile 50 = 0.283169865608 Percentile 55 = 0.329843044281 Percentile 60 = 0.332481050491 Percentile 65 = 0.380337607861 Percentile 70 = 0.428911447525 Percentile 75 = 0.438932359219 Percentile 80 = 0.530071306229 Percentile 85 = 0.597331762314 Percentile 90 = 0.735066819191 Percentile 95 = 1.08006491661 Percentile 100 = 11.7352428436 Max latancies = 11.7352428436, Min = 0.0499050617218, mean = 0.43913059745 Total objects writen = 24552 in time 302.979903936s gives 81.0350775118/s (324.140310047 MB/s) From two load generators on separate machines I achieve: Percentile 5 = 0.228541088104 Percentile 10 = 0.23213224411 Percentile 15 = 0.279508590698 Percentile 20 = 0.28137254715 Percentile 25 = 0.328829288483 Percentile 30 = 0.330499911308 Percentile 35 = 0.334045898914 Percentile 40 = 0.380131435394 Percentile 45 = 0.382810294628 Percentile 50 = 0.430188417435 Percentile 55 = 0.43399245739 Percentile 60 = 0.48120136261 Percentile 65 = 0.530511438847 Percentile 70 = 0.580485081673 Percentile 75 = 0.631661534309 Percentile 80 = 0.728989124298 Percentile 85 = 0.830820584297 Percentile 90 = 1.03238985538 Percentile 95 = 1.62925363779 Percentile 100 = 32.5414278507 Max latancies = 32.5414278507, Min = 0.0375339984894, mean = 0.863403101415 Total objects writen = 12714 in time 325.92741394s gives 39.0086855422/s (156.034742169 MB/s) Percentile 5 = 0.229072237015 Percentile 10 = 0.247376871109 Percentile 15 = 0.280901908875 Percentile 20 = 0.329082489014 Percentile 25 = 0.331234931946 Percentile 30 = 0.379406833649 Percentile 35 = 0.381390666962 Percentile 40 = 0.429595994949 Percentile 45 = 0.43164896965 Percentile 50 = 0.480262041092 Percentile 55 = 0.529169607162 Percentile 60 = 0.533170747757 Percentile 65 = 0.582635164261 Percentile 70 = 0.634325170517 Percentile 75 = 0.72939991951 Percentile 80 = 0.829002094269 Percentile 85 = 0.931713819504 Percentile 90 = 1.18014221191 Percentile 95 = 2.08048944473 Percentile 100 = 31.1357450485 Max latancies = 31.1357450485, Min = 0.0553231239319, mean = 1.03054529335 Total objects writen = 10769 in time 328.515608788s gives 32.7807863978/s (131.123145591 MB/s) Total = 278Mb/s The combined test has much higher latencies and a less than half throughput per box. If I scale this up to 5 nodes all generating load I see the throughput drop to ~50MB/s and latencies up to 60 seconds. An example slow write from dump_historic_ops is: description: osd_op(client.1892123.0:1525 \/c18\/vx1907\/kDDb\/180\/4935.ts [] 6.f4d68aae ack+ondisk+write+known_if_redirected e523), initiated_at: 2015-06-19 12:37:54.698848, age: 578.438516, duration: 38.399151, type_data: [ commit sent; apply or cleanup, { client: client.1892123, tid: 1525 }, [ { time: 2015-06-19 12:37:54.698848, event: initiated }, { time: 2015-06-19 12:37:54.856361, event: reached_pg }, { time: 2015-06-19 12:37:55.095731, event: started }, { time: 2015-06-19 12:37:55.103645, event: started }, { time: 2015-06-19 12:37:55.104125, event: commit_queued_for_journal_write }, { time: 2015-06-19 12:37:55.104900, event: write_thread_in_journal_buffer }, { time: 2015-06-19 12:37:55.106112, event: journaled_completion_queued }, { time: 2015-06-19 12:37:55.107065, event: sub_op_committed }, { time: 2015-06-19
Re: [ceph-users] cephfs unmounts itself from time to time
On Thu, Jun 18, 2015 at 10:15 PM, Roland Giesler rol...@giesler.za.net wrote: On 15 June 2015 at 13:09, Gregory Farnum g...@gregs42.com wrote: On Mon, Jun 15, 2015 at 4:03 AM, Roland Giesler rol...@giesler.za.net wrote: I have a small cluster of 4 machines and quite a few drives. After about 2 - 3 weeks cephfs fails. It's not properly mounted anymore in /mnt/cephfs, which of course causes the VM's running to fail too. In /var/log/syslog I have /mnt/cephfs: File exists at /usr/share/perl5/PVE/Storage/DirPlugin.pm line 52 repeatedly. There doesn't seem to be anything wrong with ceph at the time. # ceph -s cluster 40f26838-4760-4b10-a65c-b9c1cd671f2f health HEALTH_WARN clock skew detected on mon.s1 monmap e2: 2 mons at {h1=192.168.121.30:6789/0,s1=192.168.121.33:6789/0}, election epoch 312, quorum 0,1 h1,s1 mdsmap e401: 1/1/1 up {0=s3=up:active}, 1 up:standby osdmap e5577: 19 osds: 19 up, 19 in pgmap v11191838: 384 pgs, 3 pools, 774 GB data, 455 kobjects 1636 GB used, 9713 GB / 11358 GB avail 384 active+clean client io 12240 kB/s rd, 1524 B/s wr, 24 op/s # ceph osd tree # id weight type nameup/down reweight -111.13root default -2 8.14host h1 1 0.9 osd.1up1 3 0.9 osd.3up1 4 0.9 osd.4up1 5 0.68osd.5up1 6 0.68osd.6up1 7 0.68osd.7up1 8 0.68osd.8up1 9 0.68osd.9up1 10 0.68osd.10 up1 11 0.68osd.11 up1 12 0.68osd.12 up1 -3 0.45host s3 2 0.45osd.2up1 -4 0.9 host s2 13 0.9 osd.13 up1 -5 1.64host s1 14 0.29osd.14 up1 0 0.27osd.0up1 15 0.27osd.15 up1 16 0.27osd.16 up1 17 0.27osd.17 up1 18 0.27osd.18 up1 When I umount -l /mnt/cephfs and then mount -a after that, the the ceph volume is loaded again. I can restart the VM's and all seems well. I can't find errors pertaining to cephfs in the the other logs either. System information: Linux s1 2.6.32-34-pve #1 SMP Fri Dec 19 07:42:04 CET 2014 x86_64 GNU/Linux I'm not sure what version of Linux this really is (I assume it's a vendor kernel of some kind!), but it's definitely an old one! CephFS sees pretty continuous improvements to stability and it could be any number of resolved bugs. This is the stock standard installation of Proxmox with CephFS. If you can't upgrade the kernel, you might try out the ceph-fuse client instead as you can run a much newer and more up-to-date version of it, even on the old kernel. I'm under the impression that CephFS is the filesystem implimented by ceph-fuse. Is it not? Of course it is, but it's a different implementation than the kernel client and often has different bugs. ;) Plus you can get a newer version of it easily. Other than that, can you include more information about exactly what you mean when saying CephFS unmounts itself? Everything runs fine for weeks. Then suddenly a user reports that a VM is not functioning anymore. On investigation is transpires than CephFS is not mounted anymore and the error I reported is logged. I can't see anything else wrong at this stage. ceph is running, the osd are all up. Maybe one of our kernel devs has a better idea but I've no clue how to debug this if you can't give me any information about how CephFS came to be unmounted. It just doesn't make any sense to me. :( -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Very chatty MON logs: Is this normal?
On 2015-06-18 09:53:54 +, Joao Eduardo Luis said: Setting 'mon debug = 0/5' should be okay. Unless you see that setting '/5' impacts your performance and/or memory consumption, you should leave that be. '0/5' means 'output only debug 0 or lower to the logs; keep the last 1000 debug level 5 or lower in memory in case of a crash'. Your logs will not be as heavily populated but, if for some reason the daemon crashes, you get quite a few of debug information to help track down the source of the problem. Great, will do. Just for my understanding re/ memory: If this is a ring buffer for the last 1 events, shouldn't that be a somewhat fixed amount of memory? How would it negatively affect the MON's consumption? Assuming it works that way, once they have been running for a few days or weeks, these buffers would be full of events anyway, just more aged ones if the memory level was lower? Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Fwd: Re: Unexpected disk write activity with btrfs OSDs
Forget the reply to the list... Forwarded Message Subject:Re: [ceph-users] Unexpected disk write activity with btrfs OSDs Date: Fri, 19 Jun 2015 09:06:33 +0200 From: Burkhard Linke burkhard.li...@computational.bio.uni-giessen.de To: Lionel Bouton lionel+c...@bouton.name Hi, On 06/18/2015 11:28 PM, Lionel Bouton wrote: Hi, *snipsnap* - Disks with btrfs OSD have a spike of activity every 30s (2 intervals of 10s with nearly 0 activity, one interval with a total amount of writes of ~120MB). The averages are : 4MB/s, 100 IO/s. Just a guess: btrfs has a commit interval which defaults to 30 seconds. You can verify this by changing the interval with the commit=XYZ mount option. Best regards, Burkhard ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] qemu jemalloc patch
Hi, I have send a patch to qemu devel mailing list to add support jemalloc linking http://lists.nongnu.org/archive/html/qemu-devel/2015-06/msg05265.html Help is welcome to get it upstream ! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RadosGW Performance
I'm trying to evaluate various object stores/distributed file systems for use in our company and have a little experience of using Ceph in the past. However I'm running into a few issues when running some benchmarks against RadosGW. Basically my script is pretty dumb, but it captures one of our primary use cases reasonably accurately - it iteratively copies files repeatedly onto a different key either in s3, or to a hierarchical directory structure on a block device (eg 000/000/000/001/1.jpg) where the directory is a key. When adding to an s3-esque object store, it uses the same scheme to generate the key for the file. Now when running this script against an RBD volume I get high hundreds of MB/s throughput quite happily particularly if I run the process in parallel (forking the process multiple times). However if I try to bludgeon the script to use the s3 interface via radosgw, everything grinds to a halt (read 0.5MB/s throughput per fork). This is a problem. I don't believe that the discrepancy is due to anything other than a misconfiguration. The test cluster is running with 3 nodes, 86 drives/OSDs each (they are currently 6tb). Our use case requires the storage density to be high. HW wise, there is 256GB Ram with 2 12Core E5-2690 v3 @ 2.60GHz, so more than enough CPU/Ram capacity. Currently I have RadosGW running on one of the nodes with Apache 2.4.7 acting as the proxy. Any suggestions/pointers would be more than welcome, as ceph is high on our list of favourites due to its feature set. It definitely should be performing faster than this. Regards Stuart Harland ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd performance issue - can't find bottleneck
Hi guys, I also use a combination of intel 520 and 530 for my journals and have noticed that the latency and the speed of 520s is better than 530s. Could someone please confirm that doing the following at start up will stop the dsync on the relevant drives? # echo temporary write through /sys/class/scsi_disk/1\:0\:0\:0/cache_type Do I need to patch my kernel for this or is this already implementable in vanilla? I am running 3.19.x branch from ubuntu testing repo. Would the above change the performance of 530s to be more like 520s? Cheers Andrei - Original Message - From: Alexandre DERUMIER aderum...@odiso.com To: Jacek Jarosiewicz jjarosiew...@supermedia.pl Cc: ceph-users ceph-users@lists.ceph.com Sent: Thursday, 18 June, 2015 11:54:42 AM Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck Hi, for read benchmark with fio, what is the iodepth ? my fio 4k randr results with iodepth=1 : bw=6795.1KB/s, iops=1698 iodepth=2 : bw=14608KB/s, iops=3652 iodepth=4 : bw=32686KB/s, iops=8171 iodepth=8 : bw=76175KB/s, iops=19043 iodepth=16 :bw=173651KB/s, iops=43412 iodepth=32 :bw=336719KB/s, iops=84179 (This should be similar with rados bench -t (threads) option). This is normal because of network latencies + ceph latencies. Doing more parallism increase iops. (doing a bench with dd = iodepth=1) Theses result are with 1 client/rbd volume. now with more fio client (numjobs=X) I can reach up to 300kiops with 8-10 clients. This should be the same with lauching multiple rados bench in parallel (BTW, it could be great to have an option in rados bench to do it) - Mail original - De: Jacek Jarosiewicz jjarosiew...@supermedia.pl À: Mark Nelson mnel...@redhat.com, ceph-users ceph-users@lists.ceph.com Envoyé: Jeudi 18 Juin 2015 11:49:11 Objet: Re: [ceph-users] rbd performance issue - can't find bottleneck On 06/17/2015 04:19 PM, Mark Nelson wrote: SSD's are INTEL SSDSC2BW240A4 Ah, if I'm not mistaken that's the Intel 530 right? You'll want to see this thread by Stefan Priebe: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg05667.html In fact it was the difference in Intel 520 and Intel 530 performance that triggered many of the different investigations that have taken place by various folks into SSD flushing behavior on ATA_CMD_FLUSH. The gist of it is that the 520 is very fast but probably not safe. The 530 is safe but not fast. The DC S3700 (and similar drives with super capacitors) are thought to be both fast and safe (though some drives like the crucual M500 and later misrepresented their power loss protection so you have to be very careful!) Yes, these are Intel 530. I did the tests described in the thread You pasted and unfortunately that's my case... I think. The dd run locally on a mounted ssd partition looks like this: [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1 oflag=direct,dsync 1+0 records in 1+0 records out 358400 bytes (3.6 GB) copied, 211.698 s, 16.9 MB/s and when I skip the flag dsync it goes fast: [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1 oflag=direct 1+0 records in 1+0 records out 358400 bytes (3.6 GB) copied, 9.05432 s, 396 MB/s (I used the same 350k block size as mentioned in the e-mail from the thread above) I tried disabling the dsync like this: [root@cf02 ~]# echo temporary write through /sys/class/scsi_disk/1\:0\:0\:0/cache_type [root@cf02 ~]# cat /sys/class/scsi_disk/1\:0\:0\:0/cache_type write through ..and then locally I see the speedup: [root@cf02 journal]# dd if=/dev/zero of=test bs=350k count=1 oflag=direct,dsync 1+0 records in 1+0 records out 358400 bytes (3.6 GB) copied, 10.4624 s, 343 MB/s ..but when I test it from a client I still get slow results: root@cf03:/ceph/tmp# dd if=/dev/zero of=test bs=100M count=100 oflag=direct 100+0 records in 100+0 records out 1048576 bytes (10 GB) copied, 122.482 s, 85.6 MB/s and fio gives the same 2-3k iops. after the change to SSD cache_type I tried remounting the test image, recreating it and so on - nothing helped. I ran rbd bench-write on it, and it's not good either: root@cf03:~# rbd bench-write t2 bench-write io_size 4096 io_threads 16 bytes 1073741824 pattern seq SEC OPS OPS/SEC BYTES/SEC 1 4221 4220.64 32195919.35 2 9628 4813.95 36286083.00 3 15288 4790.90 35714620.49 4 19610 4902.47 36626193.93 5 24844 4968.37 37296562.14 6 30488 5081.31 38112444.88 7 36152 5164.54 38601615.10 8 41479 5184.80 38860207.38 9 46971 5218.70 39181437.52 10 52219 5221.77 39322641.34 11 5 5151.36 38761566.30 12 62073 5172.71 38855021.35 13 65962 5073.95 38182880.49 14 71541 5110.02 38431536.17 15 77039 5135.85 38615125.42 16 82133 5133.31 38692578.98 17 87657 5156.24 38849948.84 18 92943 5141.03 38635464.85 19 97528 5133.03
[ceph-users] EC on 1.1PB?
* I am looking to use Ceph using EC on a few leftover storage servers (36 disk supermicro servers with dual xeon sockets and around 256Gb of ram). I did a small test using one node and using the ISA library and noticed that the CPU load was pretty spikey for just normal operation. Does anyone have any experience running Ceph EC on around 216 to 270 4TB disks? I'm looking to yield around 680 TB to 1PB if possible. just putting my feelers out there to see if anyone else has had any experience and looking for any guidance.* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Block Size
Hi, I have been formatting my OSD drives with XFS (using mkfs.xfs )with default options. Is it recommended for Ceph to choose a bigger block size? I'd like to understand the impact of block size. Any recommendations? Thanks Pankaj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] EC on 1.1PB?
Hi Sean, We have ~1PB of EC storage using Dell R730xd servers with 6TB OSDs. We've got our erasure coding profile set up to be k=10,m=3 which gives us a very reasonable chunk of the raw storage with nice resiliency. I found that CPU usage was significantly higher in EC, but not so much as to be problematic. Additionally, EC performance was about 40% of replicated pool performance in our testing. With 36-disk servers you'll probably need to make sure you do the usual kernel tweaks like increasing the max number of file descriptors, etc. Cheers, Lincoln On Jun 19, 2015, at 10:36 AM, Sean wrote: I am looking to use Ceph using EC on a few leftover storage servers (36 disk supermicro servers with dual xeon sockets and around 256Gb of ram). I did a small test using one node and using the ISA library and noticed that the CPU load was pretty spikey for just normal operation. Does anyone have any experience running Ceph EC on around 216 to 270 4TB disks? I'm looking to yield around 680 TB to 1PB if possible. just putting my feelers out there to see if anyone else has had any experience and looking for any guidance. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd performance issue - can't find bottleneck
On 06/19/2015 11:19 AM, Andrei Mikhailovsky wrote: Mark, thanks for putting it down this way. It does make sense. Does it mean that having the Intel 520s, which bypass the dsync is theat to the data stored on the journals? I'm not sure if anyone has ever 100% conclusively shown that this is what they are doing, but given their performance that's the current theory. I still use them in our test lab because we've got a bunch of them and they are reasonably close in terms of performance to the DC S3700, but I'd be very concerned using them in a production environment for real data. I do have a few of these installed, alongside with 530s. I did not plan to replace them just yet. Would it make more sense to get a small battery protected raid card in front of the 520s and 530s to protect against these types of scenarios? Maybe, but only if you can disable all of the on-disk cache. Since the drive itself is (probably) doing bad things, you are kind of at it's mercy and who knows what other demons lurk. I'd be wary. Cheers - Original Message - From: Mark Nelson mnel...@redhat.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users@lists.ceph.com Sent: Friday, 19 June, 2015 5:08:31 PM Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck On 06/19/2015 10:29 AM, Andrei Mikhailovsky wrote: Mark, Thanks, I do understand that there is a risk of data loss by doing this. Having said this, ceph is designed to be fault tollerant and self repairing should something happen to individual journals, osds and server nodes. Isn't this a still good measure to compromise between data integrity and speed? So, by faking dsync and not actually doing this, you have a window of opportunity to data loss should a failure happen between the last flash and the moment of failure. Thus, if the ssd disk failure happens, regardless if dsync is used or not, would ceph still consider the osds behind the journal to be unavailable/lost and migrate the data around anyway and perform the necessary checks to make sure the data integrity is not compromised? If this is true, I would still consider using the dsync bypass in favour of the extra speed benefit. Unless I am missing a bigger picture and miscalculated something. Could someone please elaborate on this a bit further to understand the realy world threat of not using the dsync bypass? Hi Andrei, Basically the entire point of the Ceph journal is to guarantee that data hits a persistent medium before the write gets acknowledged. Imagine a scenario where you lose power just as the write happens. Scenario A: You have proper O_DSYNC writes. In this case, assuming the SSD is behaving properly, you can be fairly confident that the write to the local journal succeeded (or not). Scenario B: You bypass O_DSYNC. The journal write completes quickly, but it's not actually written out to flash, just to the drive cache. If the SSD has power loss protection it can theoretically write that data out to the flash before it losses power. For this reason, drives with PLP can often perform O_DSYNC writes very quickly even without this hack (ie it can ignore ATA_CMD_FLUSH). For a drive like the 530 without PLP, there's no guarantee that the data in cache will hit the flash. Ceph will *think* it did though, and the risk is worse because the write completes so fast. Now you have a scenario where ceph thinks something exists but it really doesn't (or exists in a corrupted state). This leads to all sorts of problems. If another OSD goes down and you have two copies of the data that disagree with each other, what do you do? What if not all of the replica writes succeeded but you have a copy of the data on the primary? Can you trust it? Everything starts breaking down. Mark Cheers Andrei - Original Message - From: Mark Nelson mnel...@redhat.com To: ceph-users@lists.ceph.com Sent: Friday, 19 June, 2015 3:59:55 PM Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck On 06/19/2015 09:54 AM, Andrei Mikhailovsky wrote: Hi guys, I also use a combination of intel 520 and 530 for my journals and have noticed that the latency and the speed of 520s is better than 530s. Could someone please confirm that doing the following at start up will stop the dsync on the relevant drives? # echo temporary write through /sys/class/scsi_disk/1\:0\:0\:0/cache_type Do I need to patch my kernel for this or is this already implementable in vanilla? I am running 3.19.x branch from ubuntu testing repo. Would the above change the performance of 530s to be more like 520s? I need to comment that it's *really* not a good idea to do this if you care about data integrity. There's a reason why the 530 is slower than the 520. If you need speed and you care about your data, you should really consider jumping up to the DC S3700. There's a possibility that the 730 *may* be ok as it supposedly has power loss protection, but it's
Re: [ceph-users] rbd performance issue - can't find bottleneck
On 06/19/2015 10:29 AM, Andrei Mikhailovsky wrote: Mark, Thanks, I do understand that there is a risk of data loss by doing this. Having said this, ceph is designed to be fault tollerant and self repairing should something happen to individual journals, osds and server nodes. Isn't this a still good measure to compromise between data integrity and speed? So, by faking dsync and not actually doing this, you have a window of opportunity to data loss should a failure happen between the last flash and the moment of failure. Thus, if the ssd disk failure happens, regardless if dsync is used or not, would ceph still consider the osds behind the journal to be unavailable/lost and migrate the data around anyway and perform the necessary checks to make sure the data integrity is not compromised? If this is true, I would still consider using the dsync bypass in favour of the extra speed benefit. Unless I am missing a bigger picture and miscalculated something. Could someone please elaborate on this a bit further to understand the realy world threat of not using the dsync bypass? Hi Andrei, Basically the entire point of the Ceph journal is to guarantee that data hits a persistent medium before the write gets acknowledged. Imagine a scenario where you lose power just as the write happens. Scenario A: You have proper O_DSYNC writes. In this case, assuming the SSD is behaving properly, you can be fairly confident that the write to the local journal succeeded (or not). Scenario B: You bypass O_DSYNC. The journal write completes quickly, but it's not actually written out to flash, just to the drive cache. If the SSD has power loss protection it can theoretically write that data out to the flash before it losses power. For this reason, drives with PLP can often perform O_DSYNC writes very quickly even without this hack (ie it can ignore ATA_CMD_FLUSH). For a drive like the 530 without PLP, there's no guarantee that the data in cache will hit the flash. Ceph will *think* it did though, and the risk is worse because the write completes so fast. Now you have a scenario where ceph thinks something exists but it really doesn't (or exists in a corrupted state). This leads to all sorts of problems. If another OSD goes down and you have two copies of the data that disagree with each other, what do you do? What if not all of the replica writes succeeded but you have a copy of the data on the primary? Can you trust it? Everything starts breaking down. Mark Cheers Andrei - Original Message - From: Mark Nelson mnel...@redhat.com To: ceph-users@lists.ceph.com Sent: Friday, 19 June, 2015 3:59:55 PM Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck On 06/19/2015 09:54 AM, Andrei Mikhailovsky wrote: Hi guys, I also use a combination of intel 520 and 530 for my journals and have noticed that the latency and the speed of 520s is better than 530s. Could someone please confirm that doing the following at start up will stop the dsync on the relevant drives? # echo temporary write through /sys/class/scsi_disk/1\:0\:0\:0/cache_type Do I need to patch my kernel for this or is this already implementable in vanilla? I am running 3.19.x branch from ubuntu testing repo. Would the above change the performance of 530s to be more like 520s? I need to comment that it's *really* not a good idea to do this if you care about data integrity. There's a reason why the 530 is slower than the 520. If you need speed and you care about your data, you should really consider jumping up to the DC S3700. There's a possibility that the 730 *may* be ok as it supposedly has power loss protection, but it's still not using HET MLC so the flash cells will wear out faster. It's also a consumer grade drive, so no one will give you support for this kind of use case if you have problems. Mark Cheers Andrei - Original Message - From: Alexandre DERUMIER aderum...@odiso.com To: Jacek Jarosiewicz jjarosiew...@supermedia.pl Cc: ceph-users ceph-users@lists.ceph.com Sent: Thursday, 18 June, 2015 11:54:42 AM Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck Hi, for read benchmark with fio, what is the iodepth ? my fio 4k randr results with iodepth=1 : bw=6795.1KB/s, iops=1698 iodepth=2 : bw=14608KB/s, iops=3652 iodepth=4 : bw=32686KB/s, iops=8171 iodepth=8 : bw=76175KB/s, iops=19043 iodepth=16 :bw=173651KB/s, iops=43412 iodepth=32 :bw=336719KB/s, iops=84179 (This should be similar with rados bench -t (threads) option). This is normal because of network latencies + ceph latencies. Doing more parallism increase iops. (doing a bench with dd = iodepth=1) Theses result are with 1 client/rbd volume. now with more fio client (numjobs=X) I can reach up to 300kiops with 8-10 clients. This should be the same with lauching multiple rados bench in parallel (BTW, it could be great to have an option in rados bench to do it) - Mail
[ceph-users] fail OSD prepare
I am following the quick doc. It is successful until Adding the initial monitor. So I made the osd folder (/var/local/osd0, osd10, osd20) in each node (csAnt, csBull, csCat), and deployed to prepare the OSDs. But the below error was occurred. --- jae@csElsa:~$ ceph-deploy osd prepare csAnt:/var/local/osd0 csBull:/var/local/osd10 csCat:/var/local/osd20 [ceph_deploy.conf][DEBUG ] found configuration file at: /home/jae/.cephdeploy.conf [ceph_deploy.cli][INFO ] Invoked (1.5.25): /usr/bin/ceph-deploy osd prepare csAnt:/var/local/osd0 csBull:/var/local/osd10 csCat:/var/local/osd20 [ceph_deploy][ERROR ] ConfigError: Cannot load config: [Errno 2] No such file or directory: 'ceph.conf'; has `ceph-deploy new` been run in this directory? --- Should I do something, which there is no the quick doc, before preparing the OSDs? -- Jaemyoun Lee CPS Lab. ( Cyber-Physical Systems Laboratory in Hanyang University) E-mail : jm...@cpslab.hanyang.ac.kr Homepage : http://cpslab.hanyang.ac.kr ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd performance issue - can't find bottleneck
All - I have been following this thread for a bit, and am happy to see how involved, capable, and collaborative that this ceph-users community seems to be. It appears there is a fairly strong amount of domain knowledge around the hardware used by many Ceph deployments, with a lot of thumbs up and thumbs down sort of experience based on bugs, problems, issues, configuration landmines to avoid, etc... Is there somewhere that community experience with hardware like this is being tracked? Not necessarily a full blown HWCL (hardware compability list), but maybe a more cohesive list of controllers, SSD/Spinning disks, and the community lessons learned (like when to or not to use TRIM, silent corruption, etc...)??? It seems like this is all extremely valuable information as new operators like my self come in to the picture... Yes, one can mine the email archives ... Thanks! ~~shane On 6/19/15, 9:08 AM, ceph-users on behalf of Mark Nelson ceph-users-boun...@lists.ceph.com on behalf of mnel...@redhat.com wrote: Would the above change the performance of 530s to be more like 520s? I need to comment that it's *really* not a good idea to do this if you care about data integrity. There's a reason why the 530 is slower than the 520. If you need speed and you care about your data, you should really consider jumping up to the DC S3700. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd performance issue - can't find bottleneck
Mark, thanks for putting it down this way. It does make sense. Does it mean that having the Intel 520s, which bypass the dsync is theat to the data stored on the journals? I do have a few of these installed, alongside with 530s. I did not plan to replace them just yet. Would it make more sense to get a small battery protected raid card in front of the 520s and 530s to protect against these types of scenarios? Cheers - Original Message - From: Mark Nelson mnel...@redhat.com To: Andrei Mikhailovsky and...@arhont.com Cc: ceph-users@lists.ceph.com Sent: Friday, 19 June, 2015 5:08:31 PM Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck On 06/19/2015 10:29 AM, Andrei Mikhailovsky wrote: Mark, Thanks, I do understand that there is a risk of data loss by doing this. Having said this, ceph is designed to be fault tollerant and self repairing should something happen to individual journals, osds and server nodes. Isn't this a still good measure to compromise between data integrity and speed? So, by faking dsync and not actually doing this, you have a window of opportunity to data loss should a failure happen between the last flash and the moment of failure. Thus, if the ssd disk failure happens, regardless if dsync is used or not, would ceph still consider the osds behind the journal to be unavailable/lost and migrate the data around anyway and perform the necessary checks to make sure the data integrity is not compromised? If this is true, I would still consider using the dsync bypass in favour of the extra speed benefit. Unless I am missing a bigger picture and miscalculated something. Could someone please elaborate on this a bit further to understand the realy world threat of not using the dsync bypass? Hi Andrei, Basically the entire point of the Ceph journal is to guarantee that data hits a persistent medium before the write gets acknowledged. Imagine a scenario where you lose power just as the write happens. Scenario A: You have proper O_DSYNC writes. In this case, assuming the SSD is behaving properly, you can be fairly confident that the write to the local journal succeeded (or not). Scenario B: You bypass O_DSYNC. The journal write completes quickly, but it's not actually written out to flash, just to the drive cache. If the SSD has power loss protection it can theoretically write that data out to the flash before it losses power. For this reason, drives with PLP can often perform O_DSYNC writes very quickly even without this hack (ie it can ignore ATA_CMD_FLUSH). For a drive like the 530 without PLP, there's no guarantee that the data in cache will hit the flash. Ceph will *think* it did though, and the risk is worse because the write completes so fast. Now you have a scenario where ceph thinks something exists but it really doesn't (or exists in a corrupted state). This leads to all sorts of problems. If another OSD goes down and you have two copies of the data that disagree with each other, what do you do? What if not all of the replica writes succeeded but you have a copy of the data on the primary? Can you trust it? Everything starts breaking down. Mark Cheers Andrei - Original Message - From: Mark Nelson mnel...@redhat.com To: ceph-users@lists.ceph.com Sent: Friday, 19 June, 2015 3:59:55 PM Subject: Re: [ceph-users] rbd performance issue - can't find bottleneck On 06/19/2015 09:54 AM, Andrei Mikhailovsky wrote: Hi guys, I also use a combination of intel 520 and 530 for my journals and have noticed that the latency and the speed of 520s is better than 530s. Could someone please confirm that doing the following at start up will stop the dsync on the relevant drives? # echo temporary write through /sys/class/scsi_disk/1\:0\:0\:0/cache_type Do I need to patch my kernel for this or is this already implementable in vanilla? I am running 3.19.x branch from ubuntu testing repo. Would the above change the performance of 530s to be more like 520s? I need to comment that it's *really* not a good idea to do this if you care about data integrity. There's a reason why the 530 is slower than the 520. If you need speed and you care about your data, you should really consider jumping up to the DC S3700. There's a possibility that the 730 *may* be ok as it supposedly has power loss protection, but it's still not using HET MLC so the flash cells will wear out faster. It's also a consumer grade drive, so no one will give you support for this kind of use case if you have problems. Mark Cheers Andrei - Original Message - From: Alexandre DERUMIER aderum...@odiso.com To: Jacek Jarosiewicz jjarosiew...@supermedia.pl Cc: ceph-users ceph-users@lists.ceph.com Sent: Thursday, 18 June, 2015 11:54:42 AM Subject: Re: [ceph-users] rbd performance issue - can't find
Re: [ceph-users] EC on 1.1PB?
Thanks lincoln! May I ask how many drives you have per storage node and how many threads you have available? IE are you using hyper threading and do you have more than 24 disks per node in your cluster? I noticed with our replicated cluster that disks == more pgs == more cpu/ram and with 24+ disks this ends up causing issues in some cases. So a 3 node cluster with 70 disks each is fine but scaling up to 21 and i see issues. Even with connections, pids, and file descriptors turned up. Are you using just jerasure or have you tried the ISA driver as well? Sorry for bombarding you with questions I am just curious as to where the 40% performance comes from. On 06/19/2015 11:05 AM, Lincoln Bryant wrote: Hi Sean, We have ~1PB of EC storage using Dell R730xd servers with 6TB OSDs. We've got our erasure coding profile set up to be k=10,m=3 which gives us a very reasonable chunk of the raw storage with nice resiliency. I found that CPU usage was significantly higher in EC, but not so much as to be problematic. Additionally, EC performance was about 40% of replicated pool performance in our testing. With 36-disk servers you'll probably need to make sure you do the usual kernel tweaks like increasing the max number of file descriptors, etc. Cheers, Lincoln On Jun 19, 2015, at 10:36 AM, Sean wrote: * I am looking to use Ceph using EC on a few leftover storage servers (36 disk supermicro servers with dual xeon sockets and around 256Gb of ram). I did a small test using one node and using the ISA library and noticed that the CPU load was pretty spikey for just normal operation. Does anyone have any experience running Ceph EC on around 216 to 270 4TB disks? I'm looking to yield around 680 TB to 1PB if possible. just putting my feelers out there to see if anyone else has had any experience and looking for any guidance.* ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Block Size
Pankaj, I think Linux will not allow bigger block size than page_size. If you want bigger block size than 4K, you need to rebuild the kernel I guess. Now, I am not sure if there is any internal settings (or grub param) to tweak this page size during reboot or not. I think it is recommended (or best practice ) to have bigger inode size and also inode64 mount option. Thanks Regards Somnath From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, Pankaj Sent: Friday, June 19, 2015 9:59 AM To: ceph-users@lists.ceph.com Subject: [ceph-users] Block Size Hi, I have been formatting my OSD drives with XFS (using mkfs.xfs )with default options. Is it recommended for Ceph to choose a bigger block size? I'd like to understand the impact of block size. Any recommendations? Thanks Pankaj PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] EC on 1.1PB?
We're running 12 OSDs per node, with 32 hyper-threaded CPUs available. We over-provisioned the CPUs because we would like to additionally run jobs from our batch system and isolate them via cgroups (we're a high-throughput computing facility). . With a total of ~13000 pgs across a few pools, I'm seeing about 1GB of resident memory per OSD. As far as EC plugins go, we're using jerasure and haven't experimented with others. That said, in our use case we're using CephFS, so we're fronting the erasure-coded pool with a cache tier. The cache pool is limited to 5TB, and right now usage is light enough that most operations live in the cache tier and rarely get flushed out to the EC pool. I'm sure as we bring more users onto this, there will be some more tweaking to do. As far as performance goes, you might want to read Mark Nelson's excellent document about EC performance under Firefly. If you search the list archives, he sent a mail in February titled Erasure Coding CPU Overhead Data. I can forward you the PDF off-list if you would like. --Lincoln On Jun 19, 2015, at 12:42 PM, Sean wrote: Thanks lincoln! May I ask how many drives you have per storage node and how many threads you have available? IE are you using hyper threading and do you have more than 24 disks per node in your cluster? I noticed with our replicated cluster that disks == more pgs == more cpu/ram and with 24+ disks this ends up causing issues in some cases. So a 3 node cluster with 70 disks each is fine but scaling up to 21 and i see issues. Even with connections, pids, and file descriptors turned up. Are you using just jerasure or have you tried the ISA driver as well? Sorry for bombarding you with questions I am just curious as to where the 40% performance comes from. On 06/19/2015 11:05 AM, Lincoln Bryant wrote: Hi Sean, We have ~1PB of EC storage using Dell R730xd servers with 6TB OSDs. We've got our erasure coding profile set up to be k=10,m=3 which gives us a very reasonable chunk of the raw storage with nice resiliency. I found that CPU usage was significantly higher in EC, but not so much as to be problematic. Additionally, EC performance was about 40% of replicated pool performance in our testing. With 36-disk servers you'll probably need to make sure you do the usual kernel tweaks like increasing the max number of file descriptors, etc. Cheers, Lincoln On Jun 19, 2015, at 10:36 AM, Sean wrote: I am looking to use Ceph using EC on a few leftover storage servers (36 disk supermicro servers with dual xeon sockets and around 256Gb of ram). I did a small test using one node and using the ISA library and noticed that the CPU load was pretty spikey for just normal operation. Does anyone have any experience running Ceph EC on around 216 to 270 4TB disks? I'm looking to yield around 680 TB to 1PB if possible. just putting my feelers out there to see if anyone else has had any experience and looking for any guidance. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Unexpected period of iowait, no obvious activity?
Hi! Recently over a few hours our 4 Ceph disk nodes showed unusually high and somewhat constant iowait times. Cluster runs 0.94.1 on Ubuntu 14.04.1. It started on one node, then - with maybe 15 minutes delay each - on the next and the next one. Overall duration of the phenomenon was about 90 minutes on each machine, finishing in the same order they had started. We could not see any obvious cluster activity during that time, applications did not do anything out of the ordinary. Scrubbing and deep scrubbing were turned off long before this happened. We are using CephFS for shared administrator home directories on the system, RBD volumes for OpenStack and the Rados Gateway to manage application data via the Swift interface. Telemetry and logs from inside the VMs did not offer an explanation either. The fact that these readings were limited to OSD hosts, but none of the other (client) nodes in the system, suggests this must be some kind of Ceph behaviour. Any ideas? We would like to understand what the system was doing, but haven't found anything obvious in the logs. Thanks! Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com