Re: [ceph-users] .New Ceph cluster - cannot add additional monitor
Just to follow up, I started from scratch, and I think the key was to run ceph-deploy purge (nodes) , ceph-deploy purgdata (nodes) and finally ceph-deploy forgetkeys Thanks for the replies Alex and Alex! Mike C ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Interesting postmortem on SSDs from Algolia
There's often a great deal of discussion about which SSDs to use for journals, and why some of the cheaper SSDs end up being more expensive in the long run. The recent blog post at Algoria, though not Ceph specific, provides a good illustration of exactly how insidious kernel/SSD interactions can be. Thought the list might find it interesting. https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/ -Steve -- Steve Anthony LTS HPC Support Specialist Lehigh University sma...@lehigh.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Very chatty MON logs: Is this normal?
However, I'd rather not set the level to 0/0, as that would disable all logging from the MONs I don't think so. All the error scenarios and stack trace (in case of crash) are supposed to be logged with log level 0. But, generally, we need the highest log level (say 20) to get all the information when something to debug. So, I doubt how beneficial it will be to enable logging for some intermediate levels. Probably, there is no guideline for these log level too which developer should follow strictly. Thanks Regards Somnath -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Daniel Schneller Sent: Wednesday, June 17, 2015 12:11 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Very chatty MON logs: Is this normal? On 2015-06-17 18:52:51 +, Somnath Roy said: This is presently written from log level 1 onwards :-) So, only log level 0 will not log this.. Try, 'debug_mon = 0/0' in the conf file.. Yeah, once I had sent the mail I realized that 1 in the log line was the level. Had overlooked that before. However, I'd rather not set the level to 0/0, as that would disable all logging from the MONs. Now, I don't have enough knowledge on that part to say whether it is important enough to log at log level 1 , sorry :-( That would indeed be an interesting to know. Judging from the sheer amount, at least I have my doubts, because the cluster seems to be running without any issues. So I figure at least it isn't indicative of an immediate issue. Anyone with a little more definitve knowledge around? Should I create a bug ticket for this? Cheers, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com PLEASE NOTE: The information contained in this electronic mail message is intended only for the use of the designated recipient(s) named above. If the reader of this message is not the intended recipient, you are hereby notified that you have received this message in error and that any review, dissemination, distribution, or copying of this message is strictly prohibited. If you have received this communication in error, please notify the sender by telephone or e-mail (as shown above) immediately and destroy any and all copies of this message in your possession (whether hard copies or electronically stored copies). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD with OCFS2
Sorry Prabu, I forgot to mention the bold settings in the conf file you need to tweak based on your HW configuration (cpu, disk etc.) and number of OSDs otherwise it may hit you back badly. Thanks Regards Somnath From: Somnath Roy Sent: Wednesday, June 17, 2015 11:25 AM To: 'gjprabu' Cc: Kamala Subramani; ceph-users@lists.ceph.com; Siva Sokkumuthu Subject: RE: RE: Re: [ceph-users] Ceph OSD with OCFS2 Okay..You didn’t mention anything about your rbd client host config and the cpu cores of OSD/rbd system..Some thoughts what you can do… 1. Considering pretty lean cpu config you have , I would say check for cpu usage of both OSD and rbd nodes first. If it is saturated already, you are out of luck ☺ 2. There are quite a bit of write path improvement went in with Hammer and latest ceph, hope you are using that code base. 3. I would say put ceph journal on SSD at least, this should give you a boost. 4. Check the pool pg number , hope this is at least 64 or so. 5. if you are using kernel rbd to map, take the latest krbd code base and build for your kernel. Reason is, there are some very important krbd performance fix went in that unfortunately probably yet to be part of any kernel. That should give you a boost. 6. Make the following changes in your conf file if you are not doing it already and see if it is improving anything. Make sure you are at least using ‘hammer’ for this.. auth_supported = none auth_service_required = none auth_client_required = none auth_cluster_required = none debug_lockdep = 0/0 debug_context = 0/0 debug_crush = 0/0 debug_buffer = 0/0 debug_timer = 0/0 debug_filer = 0/0 debug_objecter = 0/0 debug_rados = 0/0 debug_rbd = 0/0 debug_journaler = 0/0 debug_objectcatcher = 0/0 debug_client = 0/0 debug_osd = 0/0 debug_optracker = 0/0 debug_objclass = 0/0 debug_filestore = 0/0 debug_keyvaluestore = 0/0 debug_newstore = 0/0 debug_journal = 0/0 debug_ms = 0/0 debug_monc = 0/0 debug_tp = 0/0 debug_auth = 0/0 debug_finisher = 0/0 debug_heartbeatmap = 0/0 debug_perfcounter = 0/0 debug_asok = 0/0 debug_throttle = 0/0 debug_mon = 0/0 debug_paxos = 0/0 debug_rgw = 0/0 osd_op_threads = 2 ms_crc_data = false ms_crc_header = false osd_op_num_threads_per_shard = 1 osd_op_num_shards = 12 osd_enable_op_tracker = false 7. How many copies are you having , 1 or 2 ? Thanks Regards Somnath From: gjprabu [mailto:gjpr...@zohocorp.com] Sent: Wednesday, June 17, 2015 12:05 AM To: Somnath Roy Cc: Kamala Subramani; ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com; Siva Sokkumuthu Subject: Re: RE: Re: [ceph-users] Ceph OSD with OCFS2 Hi Somnath, Yes, We will analyze is there any bottleneck, do we have any valuable command to analyze this bottleneck. 1. What is your backend cluster configuration like how many OSDs, PGs/pool, HW details etc We are using 2 OSD and there is no PGs/Pool created , it is default. Hardware is physical machine above 2 GB RAM. 2. Is it a single big rbd image you mounted from different hosts and running OCFS2 on top ? Please give some details on that front. Yes, It is single rbd image we are using in different hosts and running OCFS2 on top rbd ls newinteg rbd showmapped id pool imagesnap device 1 rbd newinteg -/dev/rbd1 rbd info newinteg rbd image 'newinteg': size 7 MB in 17500 objects order 22 (4096 kB objects) block_name_prefix: rb.0.1149.74b0dc51 format: 1 3. Also, is this HDD or SSD setup ? If HDD, hope you have journals on SSD. Hope so this HDD and below is the out put for disk. *-ide description: IDE interface product: 82371SB PIIX3 IDE [Natoma/Triton II] vendor: Intel Corporation physical id: 1.1 bus info: pci@:00:01.1mailto:pci@:00:01.1 version: 00 width: 32 bits clock: 33MHz capabilities: ide bus_master configuration: driver=ata_piix latency=0 resources: irq:0 ioport:1f0(size=8) ioport:3f6 ioport:170(size=8) ioport:376 ioport:c000(size=16) *-scsi description: SCSI storage controller product: Virtio block device vendor: Red Hat, Inc physical id: 4 bus info: pci@:00:04.0mailto:pci@:00:04.0 version: 00 width: 32 bits clock: 33MHz capabilities: scsi msix bus_master cap_list configuration: driver=virtio-pci latency=0 resources: irq:11 ioport:c080(size=64) memory:f204-f2040fff Regards Prabu GJ On Tue, 16 Jun 2015 21:50:29 +0530 Somnath Roy somnath@sandisk.commailto:somnath@sandisk.com wrote Okay…I think the extra
[ceph-users] best Linux distro for Ceph
Ok - I know this post has the potential to spread to unsavory corners of discussion about the best linux distro ... blah blah blah ... please, don't let it go there ... ! I'm seeking some input from people that have been running larger Ceph clusters ... on the order of 100s of physical servers with thousands of OSDs in them. Our primary use case is Object via Swift API integration and adding Block store capability for both OpenStack/KVM backing VMs, as well as general use for various block store scenarios. We'd *like* to look at CephFS, and I'm heartened to see a kernel module (over the FUSE based), and a growing use base around it, and hoping production ready will soon be stamped on CephFS ... We currently deploy Ubuntu (primarily Trusty - 14.04), and CentOS 7.1. Currently we've been testing our Ceph clusters on both, but our preference as an organization is CentOS 7.1.1503 (currently). However - I see a lot of noise in the list about needing to track the more modern kernel versions as opposed to the already dated 3.10.x that CentOS 7.1 deploys. Yes, I know RH and community backport a lot of the newer kernel features to their kernel version ... but ... not everything gets backported. Can someone out there with real world, larger scale Ceph cluster operational experience provide a guideline on the Linux Distro they deploy/use, and works well with Ceph, and is more inline with keeping up with modern kernel versions ... without crossing the line in to the bleeding and painful edge versions ... ? Thank you ... ~~shane ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph OSD with OCFS2
Okay..You didn’t mention anything about your rbd client host config and the cpu cores of OSD/rbd system..Some thoughts what you can do… 1. Considering pretty lean cpu config you have , I would say check for cpu usage of both OSD and rbd nodes first. If it is saturated already, you are out of luck ☺ 2. There are quite a bit of write path improvement went in with Hammer and latest ceph, hope you are using that code base. 3. I would say put ceph journal on SSD at least, this should give you a boost. 4. Check the pool pg number , hope this is at least 64 or so. 5. if you are using kernel rbd to map, take the latest krbd code base and build for your kernel. Reason is, there are some very important krbd performance fix went in that unfortunately probably yet to be part of any kernel. That should give you a boost. 6. Make the following changes in your conf file if you are not doing it already and see if it is improving anything. Make sure you are at least using ‘hammer’ for this.. auth_supported = none auth_service_required = none auth_client_required = none auth_cluster_required = none debug_lockdep = 0/0 debug_context = 0/0 debug_crush = 0/0 debug_buffer = 0/0 debug_timer = 0/0 debug_filer = 0/0 debug_objecter = 0/0 debug_rados = 0/0 debug_rbd = 0/0 debug_journaler = 0/0 debug_objectcatcher = 0/0 debug_client = 0/0 debug_osd = 0/0 debug_optracker = 0/0 debug_objclass = 0/0 debug_filestore = 0/0 debug_keyvaluestore = 0/0 debug_newstore = 0/0 debug_journal = 0/0 debug_ms = 0/0 debug_monc = 0/0 debug_tp = 0/0 debug_auth = 0/0 debug_finisher = 0/0 debug_heartbeatmap = 0/0 debug_perfcounter = 0/0 debug_asok = 0/0 debug_throttle = 0/0 debug_mon = 0/0 debug_paxos = 0/0 debug_rgw = 0/0 osd_op_threads = 2 ms_crc_data = false ms_crc_header = false osd_op_num_threads_per_shard = 1 osd_op_num_shards = 12 osd_enable_op_tracker = false 7. How many copies are you having , 1 or 2 ? Thanks Regards Somnath From: gjprabu [mailto:gjpr...@zohocorp.com] Sent: Wednesday, June 17, 2015 12:05 AM To: Somnath Roy Cc: Kamala Subramani; ceph-users@lists.ceph.com; Siva Sokkumuthu Subject: Re: RE: Re: [ceph-users] Ceph OSD with OCFS2 Hi Somnath, Yes, We will analyze is there any bottleneck, do we have any valuable command to analyze this bottleneck. 1. What is your backend cluster configuration like how many OSDs, PGs/pool, HW details etc We are using 2 OSD and there is no PGs/Pool created , it is default. Hardware is physical machine above 2 GB RAM. 2. Is it a single big rbd image you mounted from different hosts and running OCFS2 on top ? Please give some details on that front. Yes, It is single rbd image we are using in different hosts and running OCFS2 on top rbd ls newinteg rbd showmapped id pool imagesnap device 1 rbd newinteg -/dev/rbd1 rbd info newinteg rbd image 'newinteg': size 7 MB in 17500 objects order 22 (4096 kB objects) block_name_prefix: rb.0.1149.74b0dc51 format: 1 3. Also, is this HDD or SSD setup ? If HDD, hope you have journals on SSD. Hope so this HDD and below is the out put for disk. *-ide description: IDE interface product: 82371SB PIIX3 IDE [Natoma/Triton II] vendor: Intel Corporation physical id: 1.1 bus info: pci@:00:01.1mailto:pci@:00:01.1 version: 00 width: 32 bits clock: 33MHz capabilities: ide bus_master configuration: driver=ata_piix latency=0 resources: irq:0 ioport:1f0(size=8) ioport:3f6 ioport:170(size=8) ioport:376 ioport:c000(size=16) *-scsi description: SCSI storage controller product: Virtio block device vendor: Red Hat, Inc physical id: 4 bus info: pci@:00:04.0mailto:pci@:00:04.0 version: 00 width: 32 bits clock: 33MHz capabilities: scsi msix bus_master cap_list configuration: driver=virtio-pci latency=0 resources: irq:11 ioport:c080(size=64) memory:f204-f2040fff Regards Prabu GJ On Tue, 16 Jun 2015 21:50:29 +0530 Somnath Roy somnath@sandisk.commailto:somnath@sandisk.com wrote Okay…I think the extra layers you have will add some delay, but 1m is high probably (I never tested Ceph on HDD though). We can minimize it probably by optimizing the cluster setup. Please monitor your backend cluster or even the rbd nodes to see if anything is bottleneck there. Also, check if there is any delay between you are issuing request on OCFS2/rbd getting that/cluster getting that. Could you please share the following details ? 1. What is your backend
[ceph-users] Explanation for ceph osd set nodown and ceph osd cluster_snap
1) Flags available in ceph osd set are pause|noup|nodown|noout|noin|nobackfill|norecover|noscrub|nodeep-scrub|notieragent I know or can guess most of them (the docs are a “bit” lacking) But with ceph osd set nodown” I have no idea what it should be used for - to keep hammering a faulty OSD? 2) looking through the docs there I found reference to ceph osd cluster_snap” http://ceph.com/docs/v0.67.9/rados/operations/control/ http://ceph.com/docs/v0.67.9/rados/operations/control/ what does it do? how does that work? does it really work? ;-) I got a few hits on google which suggest it might not be something that really works, but looks like something we could certainly use Thanks Jan___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Very chatty MON logs: Is this normal?
On 2015-06-17 18:52:51 +, Somnath Roy said: This is presently written from log level 1 onwards :-) So, only log level 0 will not log this.. Try, 'debug_mon = 0/0' in the conf file.. Yeah, once I had sent the mail I realized that 1 in the log line was the level. Had overlooked that before. However, I'd rather not set the level to 0/0, as that would disable all logging from the MONs. Now, I don't have enough knowledge on that part to say whether it is important enough to log at log level 1 , sorry :-( That would indeed be an interesting to know. Judging from the sheer amount, at least I have my doubts, because the cluster seems to be running without any issues. So I figure at least it isn't indicative of an immediate issue. Anyone with a little more definitve knowledge around? Should I create a bug ticket for this? Cheers, Daniel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Expanding a ceph cluster with ansible
I've been working on automating a lot of our ceph admin tasks lately and am pretty pleased with how the puppet-ceph module has worked for installing packages, managing ceph.conf, and creating the mon nodes. However, I don't like the idea of puppet managing the OSDs. Since we also use ansible in my group, I took a look at ceph-ansible to see how it might be used to complete this task. I see examples for doing a rolling update and for doing an os migration, but nothing for adding a node or multiple nodes at once. I don't have a problem doing this work, but wanted to check with the community if any one has experience using ceph-ansible for this? After a lot of trial and error I found the following process works well when using ceph-deploy, but it's a lot of steps and can be error prone (especially if you have old cephx keys that haven't been removed yet): # Disable backfilling and scrubbing to prevent too many performance # impacting tasks from happening at the same time. Maybe adding norecover # to this list might be a good idea so only peering happens at first. ceph osd set nobackfill ceph osd set noscrub ceph osd set nodeep-scrub # Zap the disks to start from a clean slate ceph-deploy disk zap dnvrco01-cephosd-025:sd{b..y} # Prepare the disks. I found sleeping between adding each disk can help # prevent performance problems. ceph-deploy osd prepare dnvrco01-cephosd-025:sdh:/dev/sdb; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdi:/dev/sdb; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdj:/dev/sdb; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdk:/dev/sdc; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdl:/dev/sdc; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdm:/dev/sdc; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdn:/dev/sdd; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdo:/dev/sdd; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdp:/dev/sdd; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdq:/dev/sde; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdr:/dev/sde; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sds:/dev/sde; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdt:/dev/sdf; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdu:/dev/sdf; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdv:/dev/sdf; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdw:/dev/sdg; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdx:/dev/sdg; sleep 15 ceph-deploy osd prepare dnvrco01-cephosd-025:sdy:/dev/sdg; sleep 15 # Weight in the new OSDs. We set 'osd_crush_initial_weight = 0' to prevent # them from being added in during the prepare step. Maybe a longer weight # in the last step would make this step unncessary. ceph osd crush reweight osd.450 1.09; sleep 60 ceph osd crush reweight osd.451 1.09; sleep 60 ceph osd crush reweight osd.452 1.09; sleep 60 ceph osd crush reweight osd.453 1.09; sleep 60 ceph osd crush reweight osd.454 1.09; sleep 60 ceph osd crush reweight osd.455 1.09; sleep 60 ceph osd crush reweight osd.456 1.09; sleep 60 ceph osd crush reweight osd.457 1.09; sleep 60 ceph osd crush reweight osd.458 1.09; sleep 60 ceph osd crush reweight osd.459 1.09; sleep 60 ceph osd crush reweight osd.460 1.09; sleep 60 ceph osd crush reweight osd.461 1.09; sleep 60 ceph osd crush reweight osd.462 1.09; sleep 60 ceph osd crush reweight osd.463 1.09; sleep 60 ceph osd crush reweight osd.464 1.09; sleep 60 ceph osd crush reweight osd.465 1.09; sleep 60 ceph osd crush reweight osd.466 1.09; sleep 60 ceph osd crush reweight osd.467 1.09; sleep 60 # Once all the OSDs are added to the cluster, allow the backfill process to # begin. ceph osd unset nobackfill # Then once cluster is healthy again, re-enable scrubbing ceph osd unset noscrub ceph osd unset nodeep-scrub This E-mail and any of its attachments may contain Time Warner Cable proprietary information, which is privileged, confidential, or subject to copyright belonging to Time Warner Cable. This E-mail is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient of this E-mail, you are hereby notified that any dissemination, distribution, copying, or action taken in relation to the contents of and attachments to this E-mail is strictly prohibited and may be unlawful. If you have received this E-mail in error, please notify the sender immediately and permanently delete the original and any copy of this E-mail and any printout. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] radosgw did not create auth url for swift
Hi all, I want to use swift-client to connect ceph cluster. I have done s3 test on this cluster before. So I follow the guide to create a subuser and use swift client to test it. But always got an error 404 Not Found How can I create the auth page?Any help will be appreciated. - 1 Mon(rgw), 3OSD server(each server 3disks). - CEPH:0.94.1-13 - Swift-client:2.4.0 start-- *test@uclient:~$ swift --debug -V 1.0 -A http://192.168.1.110/auth http://192.168.1.110/auth -U melon:swift -K 'ujZx+foSYDniRzwypqnqNR7hr763zdt+Qe7TpwvR' list* INFO:urllib3.connectionpool:Starting new HTTP connection (1): 192.168.1.110 DEBUG:urllib3.connectionpool:Setting read timeout to object object at 0x7fa22f0b3090 DEBUG:urllib3.connectionpool:GET /auth HTTP/1.1 404 279 INFO:swiftclient:REQ: curl -i http://192.168.1.110/auth -X GET INFO:swiftclient:RESP STATUS: 404 Not Found INFO:swiftclient:RESP HEADERS: [('date', 'Thu, 18 Jun 2015 01:51:58 GMT'), ('content-length', '279'), ('content-type', 'text/html; charset=iso-8859-1'), ('server', 'Apache/2.4.7 (Ubuntu)')] INFO:swiftclient:RESP BODY: !DOCTYPE HTML PUBLIC -//IETF//DTD HTML 2.0//EN htmlhead title404 Not Found/title /headbody h1Not Found/h1 pThe requested URL /auth was not found on this server./p hr addressApache/2.4.7 (Ubuntu) Server at 192.168.1.110 Port 80/address /body/html ERROR:swiftclient:Auth GET failed: http://192.168.1.110/auth 404 Not Found Traceback (most recent call last): File /usr/local/lib/python2.7/dist-packages/swiftclient/client.py, line 1253, in _retry self.url, self.token = self.get_auth() File /usr/local/lib/python2.7/dist-packages/swiftclient/client.py, line 1227, in get_auth insecure=self.insecure) File /usr/local/lib/python2.7/dist-packages/swiftclient/client.py, line 397, in get_auth insecure=insecure) File /usr/local/lib/python2.7/dist-packages/swiftclient/client.py, line 278, in get_auth_1_0 http_status=resp.status, http_reason=resp.reason) ClientException: Auth GET failed: http://192.168.1.110/auth 404 Not Found Account not found stop-- Guide: 1.)https://ceph.com/docs/v0.78/radosgw/config/ 2.)http://docs.ceph.com/docs/v0.94/radosgw/config/ 3.*)**http://docs.ceph.com/docs/v0.94/radosgw/admin/ http://docs.ceph.com/docs/v0.94/radosgw/admin/* Best wishes, Mika ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware cache settings recomendation
Thanks for answer, I made some test, first leave dwc=enabled and caching on journal drive disabled. Latency grows from 20ms to 90ms on this drive. Next I enabled cache on journal drive and disabled all cache on data drives. Latency on data drives grows from 30 – 50ms to 1500 – 2000ms. Test made only on one osd host with P410i controller, with SATA drives ST1000LM014-1EJ1 for data and for journal SSD INTEL SSDSC2BW12. Regards, Mateusz From: Jan Schermer [mailto:j...@schermer.cz] Sent: Wednesday, June 17, 2015 9:41 AM To: Mateusz Skała Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Hardware cache settings recomendation Cache on top of the data drives (not journal) will not help in most cases, those writes are already buffered in the OS - so unless your OS is very light on memory and flushing constantly it will have no effect, it just adds overhead in case a flush comes. I haven’t tested this extensively with Ceph, though. Cache enabled on journal drive _could_ help if your SSD is very slow (or if you don’t have SSD for journal at all), and if it is large enough (more than the active journal size) it could prolong the life of your SSD - depending on how and when the cache starts to flush. I know from experience that write cache on Areca controller didn't flush at all until it hit a watermark (50% capacity default or something) and it will be faster than some SSDon their own. Some SSD have higher IOPS than the cache can achieve, but you likely won’t saturate that with Ceph. Another thing is write cache on the drives themselves - I’d leave that on disabled (which is probably the default) unless the drive in question has capacitors to flush the cache in case of power failure. Controllers usually have a whitelist of devices that respect flushes on which the write cache is default=enabled, but in case of for example Dell Perc you would need to have Dell original drives or enable it manually. YMMV - i’ve hit the controller cache IOPS limit in the past with cheap Dell Perc (H310 was it?) that did ~20K IOPS top on one SSD drive, while the drive itself did close to 40K. On my SSDs, disabling write cache helps latency (good for journal) bud could be troubling for the SSD lifetime. In any case I don’t think you would saturate either with Ceph, so I recommend you just test the latency with write cache enabled/disabled on the controller and pick the one that gives the best numbers this is basically how: http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ Ceph recommended way is to use everything as passthrough (initiator/target mode) or JBOD (RAID0 with single drives on some controllers), so I’d stick with that. Jan On 17 Jun 2015, at 08:01, Mateusz Skała mateusz.sk...@budikom.net wrote: Yes, all disk are in single drive raid 0. Now cache is enabled for all drives, should I disable cache for SSD drives? Regards, Mateusz From: Tyler Bishop [mailto:tyler.bis...@beyondhosting.net] Sent: Thursday, June 11, 2015 7:30 PM To: Mateusz Skała Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Hardware cache settings recomendation You want write cache to disk, no write cache for SSD. I assume all of your data disk are single drive raid 0? Tyler Bishop Chief Executive Officer 513-299-7108 x10 tyler.bis...@beyondhosting.net If you are not the intended recipient of this transmission you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. From: Mateusz Skała mateusz.sk...@budikom.net To: ceph-users@lists.ceph.com Sent: Saturday, June 6, 2015 4:09:59 AM Subject: [ceph-users] Hardware cache settings recomendation Hi, Please help me with hardware cache settings on controllers for ceph rbd best performance. All Ceph hosts have one SSD drive for journal. We are using 4 different controllers, all with BBU: • HP Smart Array P400 • HP Smart Array P410i • Dell PERC 6/i • Dell PERC H700 I have to set cache policy, on Dell settings are: • Read Policy o Read-Ahead (current) o No-Read-Ahead o Adaptive Read-Ahead • Write Policy o Write-Back (current) o Write-Through • Cache Policy o Cache I/O o Direct I/O (current) • Disk Cache Policy o Default (current) o Enabled o Disabled On HP controllers: • Cache Ratio (current: 25% Read / 75% Write) • Drive Write Cache o Enabled (current) o Disabled And there is one more setting in LogicalDrive option: • Caching: o Enabled (current) o Disabled Please verify my settings and give me some recomendations. Best regards, Mateusz ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osd_scrub_chunk_min/max scrub_sleep?
Hey gang, Some options are just not documented well… What’s up with: osd_scrub_chunk_min osd_scrub_chunk_max osd_scrub_sleep? === Tu Holmes tu.hol...@gmail.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Rename pool by id
Pavel, unfortunately there isn't a way to rename a pool usign its ID as I have learned myself the hard way since I 've faced a few months ago the exact same issue. It would be a good idea for developers to also include a way to manipulate (rename, delete, etc.) pools using the ID which is definitely unique and in my opinion would be error-resistant or at least less susceptible to errors. In order to succeed what you want try the command: rados rmpool --yes-i-really-really-mean-it which will actually remove the problematic pool, as shown here : http://cephnotes.ksperis.com/blog/2014/10/29/remove-pool-without-name . To be fair and give credits everywhere this solution was also suggested to me at the IRC channel by debian112 at that time. Best regards, George On Wed, 17 Jun 2015 17:17:55 +0600, pa...@gradient54.ru wrote: Hi all, is any way to rename a pool by ID (pool number). I have one pool with empty name, it is not used and just want delete this, but can't do it, because pool name required. ceph osd lspools 0 data,1 metadata,2 rbd,12 ,16 libvirt, I want rename this: pool #12 Thanks, Pavel ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xattrs vs. omap with radosgw
We've since merged something that stripes over several small xattrs so that we can keep things inline, but it hasn't been backported to hammer yet. See c6cdb4081e366f471b372102905a1192910ab2da. Hi Sage: You wrote yet - should we earmark it for hammer backport? Nathan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware cache settings recomendation
Yes, all disk are in single drive raid 0. Now cache is enabled for all drives, should I disable cache for SSD drives? Regards, Mateusz From: Tyler Bishop [mailto:tyler.bis...@beyondhosting.net] Sent: Thursday, June 11, 2015 7:30 PM To: Mateusz Skała Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Hardware cache settings recomendation You want write cache to disk, no write cache for SSD. I assume all of your data disk are single drive raid 0? http://static.beyondhosting.net/img/bh-small.png Tyler Bishop Chief Executive Officer 513-299-7108 x10 tyler.bis...@beyondhosting.net mailto:tyler.bis...@beyondhosting.net If you are not the intended recipient of this transmission you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. _ From: Mateusz Skała mateusz.sk...@budikom.net mailto:mateusz.sk...@budikom.net To: ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com Sent: Saturday, June 6, 2015 4:09:59 AM Subject: [ceph-users] Hardware cache settings recomendation Hi, Please help me with hardware cache settings on controllers for ceph rbd best performance. All Ceph hosts have one SSD drive for journal. We are using 4 different controllers, all with BBU: * HP Smart Array P400 * HP Smart Array P410i * Dell PERC 6/i * Dell PERC H700 I have to set cache policy, on Dell settings are: * Read Policy o Read-Ahead (current) o No-Read-Ahead o Adaptive Read-Ahead * Write Policy o Write-Back (current) o Write-Through * Cache Policy o Cache I/O o Direct I/O (current) * Disk Cache Policy o Default (current) o Enabled o Disabled On HP controllers: * Cache Ratio (current: 25% Read / 75% Write) * Drive Write Cache o Enabled (current) o Disabled And there is one more setting in LogicalDrive option: * Caching: o Enabled (current) o Disabled Please verify my settings and give me some recomendations. Best regards, Mateusz ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.94.2 Hammer released
On Thu, Jun 11, 2015 at 7:34 PM, Sage Weil sw...@redhat.com wrote: * ceph-objectstore-tool should be in the ceph server package (#11376, Ken Dreyer) We had a little trouble yum updating from 0.94.1 to 0.94.2: file /usr/bin/ceph-objectstore-tool from install of ceph-1:0.94.2-0.el6.x86_64 conflicts with file from package ceph-test-1:0.94.1-0.el6.x86_64 Reported here: http://tracker.ceph.com/issues/12033 Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware cache settings recomendation
Cache on top of the data drives (not journal) will not help in most cases, those writes are already buffered in the OS - so unless your OS is very light on memory and flushing constantly it will have no effect, it just adds overhead in case a flush comes. I haven’t tested this extensively with Ceph, though. Cache enabled on journal drive _could_ help if your SSD is very slow (or if you don’t have SSD for journal at all), and if it is large enough (more than the active journal size) it could prolong the life of your SSD - depending on how and when the cache starts to flush. I know from experience that write cache on Areca controller didn't flush at all until it hit a watermark (50% capacity default or something) and it will be faster than some SSDon their own. Some SSD have higher IOPS than the cache can achieve, but you likely won’t saturate that with Ceph. Another thing is write cache on the drives themselves - I’d leave that on disabled (which is probably the default) unless the drive in question has capacitors to flush the cache in case of power failure. Controllers usually have a whitelist of devices that respect flushes on which the write cache is default=enabled, but in case of for example Dell Perc you would need to have Dell original drives or enable it manually. YMMV - i’ve hit the controller cache IOPS limit in the past with cheap Dell Perc (H310 was it?) that did ~20K IOPS top on one SSD drive, while the drive itself did close to 40K. On my SSDs, disabling write cache helps latency (good for journal) bud could be troubling for the SSD lifetime. In any case I don’t think you would saturate either with Ceph, so I recommend you just test the latency with write cache enabled/disabled on the controller and pick the one that gives the best numbers this is basically how: http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ Ceph recommended way is to use everything as passthrough (initiator/target mode) or JBOD (RAID0 with single drives on some controllers), so I’d stick with that. Jan On 17 Jun 2015, at 08:01, Mateusz Skała mateusz.sk...@budikom.net wrote: Yes, all disk are in single drive raid 0. Now cache is enabled for all drives, should I disable cache for SSD drives? Regards, Mateusz From: Tyler Bishop [mailto:tyler.bis...@beyondhosting.net mailto:tyler.bis...@beyondhosting.net] Sent: Thursday, June 11, 2015 7:30 PM To: Mateusz Skała Cc: ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com Subject: Re: [ceph-users] Hardware cache settings recomendation You want write cache to disk, no write cache for SSD. I assume all of your data disk are single drive raid 0? Tyler Bishop Chief Executive Officer 513-299-7108 x10 tyler.bis...@beyondhosting.net mailto:tyler.bis...@beyondhosting.net If you are not the intended recipient of this transmission you are notified that disclosing, copying, distributing or taking any action in reliance on the contents of this information is strictly prohibited. From: Mateusz Skała mateusz.sk...@budikom.net mailto:mateusz.sk...@budikom.net To: ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com Sent: Saturday, June 6, 2015 4:09:59 AM Subject: [ceph-users] Hardware cache settings recomendation Hi, Please help me with hardware cache settings on controllers for ceph rbd best performance. All Ceph hosts have one SSD drive for journal. We are using 4 different controllers, all with BBU: · HP Smart Array P400 · HP Smart Array P410i · Dell PERC 6/i · Dell PERC H700 I have to set cache policy, on Dell settings are: · Read Policy o Read-Ahead (current) o No-Read-Ahead o Adaptive Read-Ahead · Write Policy o Write-Back (current) o Write-Through · Cache Policy o Cache I/O o Direct I/O (current) · Disk Cache Policy o Default (current) o Enabled o Disabled On HP controllers: · Cache Ratio (current: 25% Read / 75% Write) · Drive Write Cache o Enabled (current) o Disabled And there is one more setting in LogicalDrive option: · Caching: o Enabled (current) o Disabled Please verify my settings and give me some recomendations. Best regards, Mateusz ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] CephFS: 'ls -alR' performance terrible unless Linux cache flushed
I have done some quick tests with FUSE too: it seems to me that, both with the old and with the new kernel, FUSE is approx. five times slower than kernel driver for both reading files and getting stats. I don't know whether it is just me or if it is expected. On Wed, Jun 17, 2015 at 2:56 AM, Francois Lafont flafdiv...@free.fr wrote: Hi, On 16/06/2015 18:46, negillen negillen wrote: Fixed! At least looks like fixed. That's cool for you. ;) It seems that after migrating every node (both servers and clients) from kernel 3.10.80-1 to 4.0.4-1 the issue disappeared. Now I get decent speeds both for reading files and for getting stats from every node. It seems to me that an interesting test could be to let the old kernel in your client nodes (ie 3.10.80-1), use ceph-fuse instead of the ceph kernel module and test if you have decent speeds too. Bye. -- François Lafont ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd_cache, limiting read on high iops around 40k
If necessary, there are RPM files for centos 7: gperftools.spec https://drive.google.com/file/d/0BxoNLVWxzOJWaVVmWTA3Z18zbUE/edit?usp=drive_web pprof-2.4-1.el7.centos.noarch.rpm https://drive.google.com/file/d/0BxoNLVWxzOJWRmQ2ZEt6a1pnSVk/edit?usp=drive_web gperftools-libs-2.4-1.el7.centos.x86_64.rpm https://drive.google.com/file/d/0BxoNLVWxzOJWcVByNUZHWWJqRXc/edit?usp=drive_web gperftools-devel-2.4-1.el7.centos.x86_64.rpm https://drive.google.com/file/d/0BxoNLVWxzOJWYTUzQTNha3J3NEU/edit?usp=drive_web gperftools-debuginfo-2.4-1.el7.centos.x86_64.rpm https://drive.google.com/file/d/0BxoNLVWxzOJWVzBic043YUk2LWM/edit?usp=drive_web gperftools-2.4-1.el7.centos.x86_64.rpm https://drive.google.com/file/d/0BxoNLVWxzOJWNm81QWdQYU9ZaG8/edit?usp=drive_web 2015-06-17 8:01 GMT+03:00 Alexandre DERUMIER aderum...@odiso.com: Hi, I finally fix it with tcmalloc with TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=268435456 LD_PRELOAD} = /usr/lib/libtcmalloc_minimal.so.4 qemu I got almost same result than jemmaloc in this case, maybe a littleb it faster Here the iops results for 1qemu vm with iothread by disk (iodepth=32, 4krandread, nocache) qemu randread 4k nocache libc6 iops 1 disk 29052 2 disks 55878 4 disks 127899 8 disks 240566 15 disks269976 qemu randread 4k nocache jemmaloc iops 1 disk 41278 2 disks 75781 4 disks 195351 8 disks 294241 15 disks 298199 qemu randread 4k nocache tcmalloc 16M cache iops 1 disk 37911 2 disks 67698 4 disks 41076 8 disks 43312 15 disks 37569 qemu randread 4k nocache tcmalloc patched 256M iops 1 disk no-iothread 1 disk 42160 2 disks 83135 4 disks 194591 8 disks 306038 15 disks 302278 - Mail original - De: aderumier aderum...@odiso.com À: Mark Nelson mnel...@redhat.com Cc: ceph-users ceph-users@lists.ceph.com Envoyé: Mardi 16 Juin 2015 20:27:54 Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k I forgot to ask, is this with the patched version of tcmalloc that theoretically fixes the TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES issue? Yes, the patched version of tcmalloc, but also the last version from gperftools git. (I'm talking about qemu here, not osds). I have tried to increased TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, but it doesn't help. For osd, increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES is helping. (Benchs are still running, I try to overload them as much as possible) - Mail original - De: Mark Nelson mnel...@redhat.com À: ceph-users ceph-users@lists.ceph.com Envoyé: Mardi 16 Juin 2015 19:04:27 Objet: Re: [ceph-users] rbd_cache, limiting read on high iops around 40k I forgot to ask, is this with the patched version of tcmalloc that theoretically fixes the TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES issue? Mark On 06/16/2015 11:46 AM, Mark Nelson wrote: Hi Alexandre, Excellent find! Have you also informed the QEMU developers of your discovery? Mark On 06/16/2015 11:38 AM, Alexandre DERUMIER wrote: Hi, some news about qemu with tcmalloc vs jemmaloc. I'm testing with multiple disks (with iothreads) in 1 qemu guest. And if tcmalloc is a little faster than jemmaloc, I have hit a lot of time the tcmalloc::ThreadCache::ReleaseToCentralCache bug. increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES, don't help. with multiple disk, I'm around 200k iops with tcmalloc (before hitting the bug) and 350kiops with jemmaloc. The problem is that when I hit malloc bug, I'm around 4000-1 iops, and only way to fix is is to restart qemu ... - Mail original - De: pushpesh sharma pushpesh@gmail.com À: aderumier aderum...@odiso.com Cc: Somnath Roy somnath@sandisk.com, Irek Fasikhov malm...@gmail.com, ceph-devel ceph-de...@vger.kernel.org, ceph-users ceph-users@lists.ceph.com Envoyé: Vendredi 12 Juin 2015 08:58:21 Objet: Re: rbd_cache, limiting read on high iops around 40k Thanks, posted the question in openstack list. Hopefully will get some expert opinion. On Fri, Jun 12, 2015 at 11:33 AM, Alexandre DERUMIER aderum...@odiso.com wrote: Hi, here a libvirt xml sample from libvirt src (you need to define iothreads number, then assign then in disks). I don't use openstack, so I really don't known how it's working with it. domain type='qemu' nameQEMUGuest1/name uuidc7a5fdbd-edaf-9455-926a-d65c16db1809/uuid memory unit='KiB'219136/memory currentMemory unit='KiB'219136/currentMemory vcpu placement='static'2/vcpu iothreads2/iothreads os type arch='i686' machine='pc'hvm/type boot dev='hd'/ /os clock offset='utc'/ on_poweroffdestroy/on_poweroff on_rebootrestart/on_reboot on_crashdestroy/on_crash devices emulator/usr/bin/qemu/emulator disk type='file' device='disk' driver name='qemu' type='raw' iothread='1'/ source
[ceph-users] 10d
Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one object is missing. In all cases, the PG id is 10d. Is this an epic coincidence or could something else going on here? Best Regards, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] SSD LifeTime for Monitors
On Wed, Jun 17, 2015 at 10:18 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi, Does anybody know how many data gets written from the monitors? I was using some cheaper ssds for monitors and was wondering why they had already written 80 TB after 8 month. 3.8MB/s? That's a little more than I would naively expect, but LevelDB is probably doubling the total data written (at least), so that brings it down to 1.9MB/s of real data. How big is your PGMap and OSDMap? Do you have any logging being written to those SSDs? Etc. ;) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10d
On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one object is missing. In all cases, the PG id is 10d. Is this an epic coincidence or could something else going on here? I'm betting on something else. What OSDs is each PG mapped to? It looks like each of them is missing one object on some of the OSDs, what are the objects? -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xattrs vs. omap with radosgw
On Wed, Jun 17, 2015 at 1:02 PM, Nathan Cutler ncut...@suse.cz wrote: We've since merged something that stripes over several small xattrs so that we can keep things inline, but it hasn't been backported to hammer yet. See c6cdb4081e366f471b372102905a1192910ab2da. Hi Sage: You wrote yet - should we earmark it for hammer backport? I'm guessing https://github.com/ceph/ceph/pull/4973 is the backport for hammer (issue http://tracker.ceph.com/issues/11981) Regards Abhishek ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] SSD LifeTime for Monitors
Hi, Does anybody know how many data gets written from the monitors? I was using some cheaper ssds for monitors and was wondering why they had already written 80 TB after 8 month. Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd performance issue - can't find bottleneck
Hi, We've been doing some testing of ceph hammer (0.94.2), but the performance is very slow and we can't find what's causing the problem. Initially we've started with four nodes with 10 osd's total. The drives we've used were SATA enterprise drives and on top of that we've used SSD drives as flashcache devices for SATA drives and for storing OSD's journal. The local tests on each of the four nodes are giving the results you'd expect: ~500MB/s seq writes and reads from SSD's, ~40k iops random reads from SSD's, ~200MB/s seq writes and reads from SATA drives ~600 iops random reads from SATA drives ..but when we've tested this setup from a client we got rather slow results.. so we've tried to find a bottleneck and tested the network by connecting client to our nodes via NFS - and performance via NFS is as expected (similar results to local tests, only slightly slower). So we've reconfigured ceph to not use SATA drives and just setup OSD's on SSD drives (we wanted to test if maybe this is a flashcache problem?) ..but to no success, the results of rbd i/o tests from two osd nodes setup on SSD drives are like this: ~60MB/s seq writes ~100MB/s seq reads ~2-3k iops random reads The client is an rbd mounted on a linux ubuntu box. All the servers (osd nodes and the client) are running Ubuntu Server 14.04. We tried to switch to CentOS 7 - but the results are the same. Here are some technical details about our setup: Four exact same osd nodes: E5-1630 CPU 32 GB RAM Mellanox MT27520 56Gbps network cards SATA controller LSI Logic SAS3008 Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD Four monitors (one on each node). We do not use CephFS so we do not run ceph-mds. During the tests we were monitoring all osd nodes and the client - we haven't seen any problems on none of the hosts - load was low, there were no cpu waits, no abnormal system interrupts, no i/o problems on the disks - all the systems seemed to not sweat at all and yet the results are rather dissatisfying.. we're kinda lost, any help will be appreciated. Cheers, J -- Jacek Jarosiewicz Administrator Systemów Informatycznych SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie ul. Senatorska 13/15, 00-075 Warszawa Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr KRS 029537; kapitał zakładowy 42.756.000 zł NIP: 957-05-49-503 Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa SUPERMEDIA - http://www.supermedia.pl dostep do internetu - hosting - kolokacja - lacza - telefonia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 10d
On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote: On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com wrote: Hi, After upgrading to 0.94.2 yesterday on our test cluster, we've had 3 PGs go inconsistent. First, immediately after we updated the OSDs PG 34.10d went inconsistent: 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 : cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones, 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136 bytes,0/0 hit_set_archive bytes. Second, an hour later 55.10d went inconsistent: 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 : cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes. Then last night 36.10d suffered the same fate: 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 : cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects, 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes. In all cases, one object is missing. In all cases, the PG id is 10d. Is this an epic coincidence or could something else going on here? I'm betting on something else. What OSDs is each PG mapped to? It looks like each of them is missing one object on some of the OSDs, what are the objects? 34.10d: [52,202,218] 55.10d: [303,231,65] 36.10d: [30,171,69] So no common OSDs. I've already repaired all of these PGs, and logs have nothing interesting, so I can't say more about the objects. Cheers, Dan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd performance issue - can't find bottleneck
On 06/17/2015 04:10 AM, Jacek Jarosiewicz wrote: Hi, We've been doing some testing of ceph hammer (0.94.2), but the performance is very slow and we can't find what's causing the problem. Initially we've started with four nodes with 10 osd's total. The drives we've used were SATA enterprise drives and on top of that we've used SSD drives as flashcache devices for SATA drives and for storing OSD's journal. The local tests on each of the four nodes are giving the results you'd expect: ~500MB/s seq writes and reads from SSD's, ~40k iops random reads from SSD's, ~200MB/s seq writes and reads from SATA drives ~600 iops random reads from SATA drives ..but when we've tested this setup from a client we got rather slow results.. so we've tried to find a bottleneck and tested the network by connecting client to our nodes via NFS - and performance via NFS is as expected (similar results to local tests, only slightly slower). So we've reconfigured ceph to not use SATA drives and just setup OSD's on SSD drives (we wanted to test if maybe this is a flashcache problem?) ..but to no success, the results of rbd i/o tests from two osd nodes setup on SSD drives are like this: ~60MB/s seq writes ~100MB/s seq reads ~2-3k iops random reads Is this per SSD or aggregate? The client is an rbd mounted on a linux ubuntu box. All the servers (osd nodes and the client) are running Ubuntu Server 14.04. We tried to switch to CentOS 7 - but the results are the same. Is this kernel RBD or a VM using QEMU/KVM? You might want to try fio with the librbd engine and see if you get the same results. Also, radosbench isn't exactly analogous, but you might try some large sequential write / sequential read tests just as a sanity check. Here are some technical details about our setup: Four exact same osd nodes: E5-1630 CPU 32 GB RAM Mellanox MT27520 56Gbps network cards SATA controller LSI Logic SAS3008 Specs look fine. Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD Is that where the SSDs live? I'm not a fan of such heavy expander over-subscription, but if you are getting good results outside of Ceph I'm guessing it's something else. Four monitors (one on each node). We do not use CephFS so we do not run ceph-mds. You'll want to go down to 3 or up to 5. Even numbers of monitors don't really help you in any way (and can actually hurt). I'd suggest 3. During the tests we were monitoring all osd nodes and the client - we haven't seen any problems on none of the hosts - load was low, there were no cpu waits, no abnormal system interrupts, no i/o problems on the disks - all the systems seemed to not sweat at all and yet the results are rather dissatisfying.. we're kinda lost, any help will be appreciated. You didn't mention the brand/model of SSDs. Especially for writes this is important as ceph journal writes are O_DSYNC. Drives that have proper write loss protection often can ignore ATA_CMD_FLUSH and do these very quickly while other drives may need to flush to the flash cells. Also, keep in mind for writes that if you have journals on the SSDs and 3X replication, you'll be doing 6 writes for every client write. For reads and read IOPs on SSDs, you might try disabling in-memory logging and ceph authentication. You might be interested in some testing we did on a variety of SSDs here: http://www.spinics.net/lists/ceph-users/msg15733.html Cheers, J ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd performance issue - can't find bottleneck
Hi, can you post your ceph.conf ? Which tools do you use for benchmark ? which block size, iodepth, number of client/rbd volume do you use ? Is it with krbd kernel driver ? (I have seen some bad performance with kernel 3.16, but at much higher rate (100k iops) Is it with ethernet switches ? or ip over infiniband ? your results seem quite low anyway. I'm also using ethernet mellanox switchs (10GBE), sas3008 (dell r630). and I can reach around 250kiops randread 4K with 1osd (with 80% usage of 2x10cores 3,1ghz) here my ceph.conf - [global] fsid = public_network = mon_initial_members = ... mon_host =. auth_cluster_required = none auth_service_required = none auth_client_required = none filestore_xattr_use_omap = true osd_pool_default_min_size = 1 debug_lockdep = 0/0 debug_context = 0/0 debug_crush = 0/0 debug_buffer = 0/0 debug_timer = 0/0 debug_journaler = 0/0 debug_osd = 0/0 debug_optracker = 0/0 debug_objclass = 0/0 debug_filestore = 0/0 debug_journal = 0/0 debug_ms = 0/0 debug_monc = 0/0 debug_tp = 0/0 debug_auth = 0/0 debug_finisher = 0/0 debug_heartbeatmap = 0/0 debug_perfcounter = 0/0 debug_asok = 0/0 debug_throttle = 0/0 osd_op_threads = 5 filestore_op_threads = 4 osd_op_num_threads_per_shard = 2 osd_op_num_shards = 10 filestore_fd_cache_size = 64 filestore_fd_cache_shards = 32 ms_nocrc = true ms_dispatch_throttle_bytes = 0 cephx_sign_messages = false cephx_require_signatures = false throttler_perf_counter = false ms_crc_header = false ms_crc_data = false [osd] osd_client_message_size_cap = 0 osd_client_message_cap = 0 osd_enable_op_tracker = false (main boost are disable cephx_auth, debug, and increase thread sharding) - Mail original - De: Jacek Jarosiewicz jjarosiew...@supermedia.pl À: ceph-users ceph-users@lists.ceph.com Envoyé: Mercredi 17 Juin 2015 11:10:26 Objet: [ceph-users] rbd performance issue - can't find bottleneck Hi, We've been doing some testing of ceph hammer (0.94.2), but the performance is very slow and we can't find what's causing the problem. Initially we've started with four nodes with 10 osd's total. The drives we've used were SATA enterprise drives and on top of that we've used SSD drives as flashcache devices for SATA drives and for storing OSD's journal. The local tests on each of the four nodes are giving the results you'd expect: ~500MB/s seq writes and reads from SSD's, ~40k iops random reads from SSD's, ~200MB/s seq writes and reads from SATA drives ~600 iops random reads from SATA drives ..but when we've tested this setup from a client we got rather slow results.. so we've tried to find a bottleneck and tested the network by connecting client to our nodes via NFS - and performance via NFS is as expected (similar results to local tests, only slightly slower). So we've reconfigured ceph to not use SATA drives and just setup OSD's on SSD drives (we wanted to test if maybe this is a flashcache problem?) ..but to no success, the results of rbd i/o tests from two osd nodes setup on SSD drives are like this: ~60MB/s seq writes ~100MB/s seq reads ~2-3k iops random reads The client is an rbd mounted on a linux ubuntu box. All the servers (osd nodes and the client) are running Ubuntu Server 14.04. We tried to switch to CentOS 7 - but the results are the same. Here are some technical details about our setup: Four exact same osd nodes: E5-1630 CPU 32 GB RAM Mellanox MT27520 56Gbps network cards SATA controller LSI Logic SAS3008 Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD Four monitors (one on each node). We do not use CephFS so we do not run ceph-mds. During the tests we were monitoring all osd nodes and the client - we haven't seen any problems on none of the hosts - load was low, there were no cpu waits, no abnormal system interrupts, no i/o problems on the disks - all the systems seemed to not sweat at all and yet the results are rather dissatisfying.. we're kinda lost, any help will be appreciated. Cheers, J -- Jacek Jarosiewicz Administrator Systemów Informatycznych SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie ul. Senatorska 13/15, 00-075 Warszawa Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr KRS 029537; kapitał zakładowy 42.756.000 zł NIP: 957-05-49-503 Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa SUPERMEDIA - http://www.supermedia.pl dostep do internetu - hosting - kolokacja - lacza - telefonia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com
Re: [ceph-users] rbd performance issue - can't find bottleneck
On 06/17/2015 03:38 PM, Alexandre DERUMIER wrote: Hi, can you post your ceph.conf ? sure: [global] fsid = e96fdc70-4f9c-4c12-aae8-63dd7c64c876 mon initial members = cf01,cf02 mon host = 10.4.10.211,10.4.10.212 auth cluster required = cephx auth service required = cephx auth client required = cephx filestore xattr use omap = true public network = 10.4.10.0/24 #cluster network = 192.168.10.0/24 osd journal size = 10240 #journal dio = false osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 512 osd pool default pgp num = 512 osd crush chooseleaf type = 1 [mon.cf01] host = cf01 mon addr = 10.4.10.211:6789 [mon.cf02] host = cf02 mon addr = 10.4.10.212:6789 [osd.0] host = cf01 [osd.1] host = cf02 Which tools do you use for benchmark ? which block size, iodepth, number of client/rbd volume do you use ? I use fio for random reads and dd for seq reads and writes. Block size is 4k (fs on the osd is XFS). I used iodepths 1,4,16,32 - the more io in queue the worse performance. The results I posted in my message are from fio command run like this: fio --name=randread --numjobs=1 --rw=randread --bs=4k --size=10G --filename=test10g --direct=1 Is it with krbd kernel driver ? (I have seen some bad performance with kernel 3.16, but at much higher rate (100k iops) Is it with ethernet switches ? or ip over infiniband ? kernel driver, kernel version: 3.10.0-229.4.2.el7.x86_64 (last tests were on CentOS 7.1, when we used Ubuntu - kernel version was 3.13.0-53-generic) We use ethernet switches (mellanox msx1012). Switches are configured with MLAG and we use mellanox dual port 56Gbps cards with bond interfaces configured as round-robin. your results seem quite low anyway. yes.. :( I'm also using ethernet mellanox switchs (10GBE), sas3008 (dell r630). and I can reach around 250kiops randread 4K with 1osd (with 80% usage of 2x10cores 3,1ghz) here my ceph.conf - [global] fsid = public_network = mon_initial_members = ... mon_host =. auth_cluster_required = none auth_service_required = none auth_client_required = none filestore_xattr_use_omap = true osd_pool_default_min_size = 1 debug_lockdep = 0/0 debug_context = 0/0 debug_crush = 0/0 debug_buffer = 0/0 debug_timer = 0/0 debug_journaler = 0/0 debug_osd = 0/0 debug_optracker = 0/0 debug_objclass = 0/0 debug_filestore = 0/0 debug_journal = 0/0 debug_ms = 0/0 debug_monc = 0/0 debug_tp = 0/0 debug_auth = 0/0 debug_finisher = 0/0 debug_heartbeatmap = 0/0 debug_perfcounter = 0/0 debug_asok = 0/0 debug_throttle = 0/0 osd_op_threads = 5 filestore_op_threads = 4 osd_op_num_threads_per_shard = 2 osd_op_num_shards = 10 filestore_fd_cache_size = 64 filestore_fd_cache_shards = 32 ms_nocrc = true ms_dispatch_throttle_bytes = 0 cephx_sign_messages = false cephx_require_signatures = false throttler_perf_counter = false ms_crc_header = false ms_crc_data = false [osd] osd_client_message_size_cap = 0 osd_client_message_cap = 0 osd_enable_op_tracker = false (main boost are disable cephx_auth, debug, and increase thread sharding) Will try Your suggested config and let You know, thanks! J -- Jacek Jarosiewicz Administrator Systemów Informatycznych SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie ul. Senatorska 13/15, 00-075 Warszawa Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr KRS 029537; kapitał zakładowy 42.756.000 zł NIP: 957-05-49-503 Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa SUPERMEDIA - http://www.supermedia.pl dostep do internetu - hosting - kolokacja - lacza - telefonia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd performance issue - can't find bottleneck
On Wed, 17 Jun 2015 16:03:17 +0200 Jacek Jarosiewicz wrote: On 06/17/2015 03:34 PM, Mark Nelson wrote: On 06/17/2015 04:10 AM, Jacek Jarosiewicz wrote: Hi, [ cut ] ~60MB/s seq writes ~100MB/s seq reads ~2-3k iops random reads Is this per SSD or aggregate? aggregate (if I understand You correctly). This is what I see when I run tests on client - a mapped and mounted rbd. The client is an rbd mounted on a linux ubuntu box. All the servers (osd nodes and the client) are running Ubuntu Server 14.04. We tried to switch to CentOS 7 - but the results are the same. Is this kernel RBD or a VM using QEMU/KVM? You might want to try fio with the librbd engine and see if you get the same results. Also, radosbench isn't exactly analogous, but you might try some large sequential write / sequential read tests just as a sanity check. This is kernel rbd - testing performance on vm's will be the next step. I've tried fio with librbd, but the results were similar. I'll run ther radosbench tests and post my results. Kernel tends to be less then stellar, but probably not your main problem. Here are some technical details about our setup: Four exact same osd nodes: E5-1630 CPU 32 GB RAM Mellanox MT27520 56Gbps network cards SATA controller LSI Logic SAS3008 Specs look fine. Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD Is that where the SSDs live? I'm not a fan of such heavy expander over-subscription, but if you are getting good results outside of Ceph I'm guessing it's something else. No, the SSD's are connected to the integrated intel sata controller (C610/X99) The only disks that reside in the SuperMicro chasis are the SATA drives. And on the last tests I don't use them - the results I gave are on SSD's only (one SSD serves as OSD and the journal is on another SSD). Four monitors (one on each node). We do not use CephFS so we do not run ceph-mds. You'll want to go down to 3 or up to 5. Even numbers of monitors don't really help you in any way (and can actually hurt). I'd suggest 3. OK, will do that, thanks! You didn't mention the brand/model of SSDs. Especially for writes this is important as ceph journal writes are O_DSYNC. Drives that have proper write loss protection often can ignore ATA_CMD_FLUSH and do these very quickly while other drives may need to flush to the flash cells. Also, keep in mind for writes that if you have journals on the SSDs and 3X replication, you'll be doing 6 writes for every client write. SSD's are INTEL SSDSC2BW240A4 Intel, they make great SSDs. And horrid product numbers in SMART to go with their differently marketed/named devices. Amyway, those are likely your problem, see: http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-November/035695.html Or any google result with Ceph intel 530 probably. When you run those tests, did you use atop or iostat to watch the SSD utilization? Christian The rbd pool is set to have min_size 1 and size 2. For reads and read IOPs on SSDs, you might try disabling in-memory logging and ceph authentication. You might be interested in some testing we did on a variety of SSDs here: http://www.spinics.net/lists/ceph-users/msg15733.html Will read up on that too, thanks! J -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Accessing Ceph from Spark
Is it possible to access Ceph from Spark as it is mentioned here for Openstack Swift? https://spark.apache.org/docs/latest/storage-openstack-swift.html Thanks for help. Milan Sladky ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd performance issue - can't find bottleneck
On 06/17/2015 03:34 PM, Mark Nelson wrote: On 06/17/2015 04:10 AM, Jacek Jarosiewicz wrote: Hi, [ cut ] ~60MB/s seq writes ~100MB/s seq reads ~2-3k iops random reads Is this per SSD or aggregate? aggregate (if I understand You correctly). This is what I see when I run tests on client - a mapped and mounted rbd. The client is an rbd mounted on a linux ubuntu box. All the servers (osd nodes and the client) are running Ubuntu Server 14.04. We tried to switch to CentOS 7 - but the results are the same. Is this kernel RBD or a VM using QEMU/KVM? You might want to try fio with the librbd engine and see if you get the same results. Also, radosbench isn't exactly analogous, but you might try some large sequential write / sequential read tests just as a sanity check. This is kernel rbd - testing performance on vm's will be the next step. I've tried fio with librbd, but the results were similar. I'll run ther radosbench tests and post my results. Here are some technical details about our setup: Four exact same osd nodes: E5-1630 CPU 32 GB RAM Mellanox MT27520 56Gbps network cards SATA controller LSI Logic SAS3008 Specs look fine. Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD Is that where the SSDs live? I'm not a fan of such heavy expander over-subscription, but if you are getting good results outside of Ceph I'm guessing it's something else. No, the SSD's are connected to the integrated intel sata controller (C610/X99) The only disks that reside in the SuperMicro chasis are the SATA drives. And on the last tests I don't use them - the results I gave are on SSD's only (one SSD serves as OSD and the journal is on another SSD). Four monitors (one on each node). We do not use CephFS so we do not run ceph-mds. You'll want to go down to 3 or up to 5. Even numbers of monitors don't really help you in any way (and can actually hurt). I'd suggest 3. OK, will do that, thanks! You didn't mention the brand/model of SSDs. Especially for writes this is important as ceph journal writes are O_DSYNC. Drives that have proper write loss protection often can ignore ATA_CMD_FLUSH and do these very quickly while other drives may need to flush to the flash cells. Also, keep in mind for writes that if you have journals on the SSDs and 3X replication, you'll be doing 6 writes for every client write. SSD's are INTEL SSDSC2BW240A4 The rbd pool is set to have min_size 1 and size 2. For reads and read IOPs on SSDs, you might try disabling in-memory logging and ceph authentication. You might be interested in some testing we did on a variety of SSDs here: http://www.spinics.net/lists/ceph-users/msg15733.html Will read up on that too, thanks! J -- Jacek Jarosiewicz Administrator Systemów Informatycznych SUPERMEDIA Sp. z o.o. z siedzibą w Warszawie ul. Senatorska 13/15, 00-075 Warszawa Sąd Rejonowy dla m.st.Warszawy, XII Wydział Gospodarczy Krajowego Rejestru Sądowego, nr KRS 029537; kapitał zakładowy 42.756.000 zł NIP: 957-05-49-503 Adres korespondencyjny: ul. Jubilerska 10, 04-190 Warszawa SUPERMEDIA - http://www.supermedia.pl dostep do internetu - hosting - kolokacja - lacza - telefonia ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] xattrs vs. omap with radosgw
On Wed, 17 Jun 2015, Nathan Cutler wrote: We've since merged something that stripes over several small xattrs so that we can keep things inline, but it hasn't been backported to hammer yet. See c6cdb4081e366f471b372102905a1192910ab2da. Hi Sage: You wrote yet - should we earmark it for hammer backport? Yes, please! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Accessing Ceph from Spark
On Wed, Jun 17, 2015 at 2:58 PM, Milan Sladky milan.sla...@outlook.com wrote: Is it possible to access Ceph from Spark as it is mentioned here for Openstack Swift? https://spark.apache.org/docs/latest/storage-openstack-swift.html Depends on what you're trying to do. It's possible that the Swift bindings described there will just work with Ceph (somebody else will have to answer that). If you're interested in CephFS, it has bindings for Hadoop and I believe Spark works with that. -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd performance issue - can't find bottleneck
On 06/17/2015 09:03 AM, Jacek Jarosiewicz wrote: On 06/17/2015 03:34 PM, Mark Nelson wrote: On 06/17/2015 04:10 AM, Jacek Jarosiewicz wrote: Hi, [ cut ] ~60MB/s seq writes ~100MB/s seq reads ~2-3k iops random reads Is this per SSD or aggregate? aggregate (if I understand You correctly). This is what I see when I run tests on client - a mapped and mounted rbd. The client is an rbd mounted on a linux ubuntu box. All the servers (osd nodes and the client) are running Ubuntu Server 14.04. We tried to switch to CentOS 7 - but the results are the same. Is this kernel RBD or a VM using QEMU/KVM? You might want to try fio with the librbd engine and see if you get the same results. Also, radosbench isn't exactly analogous, but you might try some large sequential write / sequential read tests just as a sanity check. This is kernel rbd - testing performance on vm's will be the next step. I've tried fio with librbd, but the results were similar. I'll run ther radosbench tests and post my results. Here are some technical details about our setup: Four exact same osd nodes: E5-1630 CPU 32 GB RAM Mellanox MT27520 56Gbps network cards SATA controller LSI Logic SAS3008 Specs look fine. Storage nodes are connected to SuperMicro chassis: 847E1C-R1K28JBOD Is that where the SSDs live? I'm not a fan of such heavy expander over-subscription, but if you are getting good results outside of Ceph I'm guessing it's something else. No, the SSD's are connected to the integrated intel sata controller (C610/X99) The only disks that reside in the SuperMicro chasis are the SATA drives. And on the last tests I don't use them - the results I gave are on SSD's only (one SSD serves as OSD and the journal is on another SSD). Four monitors (one on each node). We do not use CephFS so we do not run ceph-mds. You'll want to go down to 3 or up to 5. Even numbers of monitors don't really help you in any way (and can actually hurt). I'd suggest 3. OK, will do that, thanks! You didn't mention the brand/model of SSDs. Especially for writes this is important as ceph journal writes are O_DSYNC. Drives that have proper write loss protection often can ignore ATA_CMD_FLUSH and do these very quickly while other drives may need to flush to the flash cells. Also, keep in mind for writes that if you have journals on the SSDs and 3X replication, you'll be doing 6 writes for every client write. SSD's are INTEL SSDSC2BW240A4 Ah, if I'm not mistaken that's the Intel 530 right? You'll want to see this thread by Stefan Priebe: https://www.mail-archive.com/ceph-users@lists.ceph.com/msg05667.html In fact it was the difference in Intel 520 and Intel 530 performance that triggered many of the different investigations that have taken place by various folks into SSD flushing behavior on ATA_CMD_FLUSH. The gist of it is that the 520 is very fast but probably not safe. The 530 is safe but not fast. The DC S3700 (and similar drives with super capacitors) are thought to be both fast and safe (though some drives like the crucual M500 and later misrepresented their power loss protection so you have to be very careful!) The rbd pool is set to have min_size 1 and size 2. For reads and read IOPs on SSDs, you might try disabling in-memory logging and ceph authentication. You might be interested in some testing we did on a variety of SSDs here: http://www.spinics.net/lists/ceph-users/msg15733.html Will read up on that too, thanks! J ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Erasure Coded Pools and PGs
Hi, On 17/06/2015 18:04, Garg, Pankaj wrote: Hi, I have 5 OSD servers, with total of 45 OSDS in my clusters. I am trying out Erasure Coding with different K and m values. I seem to always get Warnings about : Degraded and Undersized PGs, whenever I create a profile and create a Pool based on that profile. I have profiles with K and M value pairs : (2,1), (3,3) and (5,3). By default the crush ruleset for an erasure coded pool needs as many hosts as k+m. I.e. you need 6 hosts for (3,3) and 8 for (5,3). You can change this by setting the failure domain when creating the erasure code profile as documented at http://docs.ceph.com/docs/master/rados/operations/erasure-code-jerasure/ What would be appropriate PG values? I have tried from as low as 12 to 1024 and always get the Degraded and Undersized PGs. This is quite confusing. If the problem is different, it would be great if you could file a bug report with details. The ceph report command will output all the relevant information. Cheers Thanks Pankaj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Loïc Dachary, Artisan Logiciel Libre signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Accessing Ceph from Spark
Hi Milan, We've done some tests here and our hadoop can talk to RGW successfully with this SwiftFS plugin. But we haven't tried Spark yet. One thing is the data locality feature, it actually requires some special configuration of Swift proxy-server, so RGW is not able to archive the data locality there. Could you please kindly share some deployment consideration of running Spark on Swift/Ceph? Tachyon seems more promising... Sincerely, Yuan On Wed, Jun 17, 2015 at 9:58 PM, Milan Sladky milan.sla...@outlook.com wrote: Is it possible to access Ceph from Spark as it is mentioned here for Openstack Swift? https://spark.apache.org/docs/latest/storage-openstack-swift.html Thanks for help. Milan Sladky ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Erasure Coded Pools and PGs
Hi, I have 5 OSD servers, with total of 45 OSDS in my clusters. I am trying out Erasure Coding with different K and m values. I seem to always get Warnings about : Degraded and Undersized PGs, whenever I create a profile and create a Pool based on that profile. I have profiles with K and M value pairs : (2,1), (3,3) and (5,3). What would be appropriate PG values? I have tried from as low as 12 to 1024 and always get the Degraded and Undersized PGs. This is quite confusing. Thanks Pankaj ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com