[ceph-users] Usage pattern and design of Ceph
Hi ceph-users, This is Guang and I am pretty new to ceph, glad to meet you guys in the community! After walking through some documents of Ceph, I have a couple of questions: 1. Is there any comparison between Ceph and AWS S3, in terms of the ability to handle different work-loads (from KB to GB), with corresponding performance report? 2. Looking at some industry solutions for distributed storage, GFS / Haystack / HDFS all use meta-server to store the logical-to-physical mapping within memory and avoid disk I/O lookup for file reading, is the concern valid for Ceph (in terms of latency to read file)? 3. Some industry research shows that one issue of file system is the metadata-to-data ratio, in terms of both access and storage, and some technic uses the mechanism to combine small files to large physical files to reduce the ratio (Haystack for example), if we want to use ceph to store photos, should this be a concern as Ceph use one physical file per object? Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph VM Backup
On 08/18/2013 10:58 PM, Wolfgang Hennerbichler wrote: On Sun, Aug 18, 2013 at 06:57:56PM +1000, Martin Rudat wrote: Hi, On 2013-02-25 20:46, Wolfgang Hennerbichler wrote: maybe some of you are interested in this - I'm using a dedicated VM to backup important VMs which have their storage in RBD. This is nothing fancy and not implemented perfectly, but it works. The VM's don't notice that they're backed up, the only requirement is that the filesystem of the VM is directly on the RBD, the script doesn't calculate offsets of partition tables. Looking at how you're doing that, if you trust the script to be able to create new snapshots; couldn't you do that with less machinery involved by installing the ceph binaries on the backup host, creating the snapshot and attaching it with rbd, rather than attaching it to the VM? this was written at a time where kernels could not map format 2 rbd images. Also; where's the fsck call? You're snapshotting a running system; it's almost guaranteed that you've done the snapshot in the middle of a batch of writes; then again, it would be cool to be able to ask the VM to sync, to capture a consistent filesystem, though. I use journaling filesystems. The journal is replayed during mount (can be seen in kernel logs) and the FS is therefore considered to be clean. I don't know about recent kernels, but older ones could be made to crash by boldly mounting a filesystem that hadn't been fscked. This works for production systems. That's what journals are all about, right? Correct, but older kernels might not respect barriers correctly. But if you use a modern kernel ( I think 2.6.36 or so) there won't be a problem. Like you said, on mount the journal will be replayed and the FS will be clean. It's nothing less then an unexpected shutdown. Wido wogri -- Martin Rudat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling
You're right, PGLog::undirty() looks suspicious. I just pushed a branch wip-dumpling-pglog-undirty with a new config (osd_debug_pg_log_writeout) which if set to false will disable some strictly debugging checks which occur in PGLog::undirty(). We haven't actually seen these checks causing excessive cpu usage, so this may be a red herring. -Sam On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey oli...@xs4all.nl wrote: Hey Mark, On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote: On 08/17/2013 06:13 AM, Oliver Daudey wrote: Hey all, This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I created in the tracker. Thought I would pass it through the list as well, to get an idea if anyone else is running into it. It may only show under higher loads. More info about my setup is in the bug-report above. Here goes: I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd and mds. I'm using RBD on this cluster as storage for KVM, CephFS is unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100 +MB/sec on simple linear writes to a file with `dd' inside a VM on this cluster under regular load and the osds usually averaged 20-100% CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for the osds shot up to 100% to 400% in `top' (multi-core system) and the speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users complained that disk-access inside the VMs was significantly slower and the backups of the RBD-store I was running, also got behind quickly. After downgrading only the osds to v0.61.7 Cuttlefish and leaving the rest at 0.67 Dumpling, speed and load returned to normal. I have repeated this performance-hit upon upgrade on a similar test-cluster under no additional load at all. Although CPU-usage for the osds wasn't as dramatic during these tests because there was no base-load from other VMs, I/O-performance dropped significantly after upgrading during these tests as well, and returned to normal after downgrading the osds. I'm not sure what to make of it. There are no visible errors in the logs and everything runs and reports good health, it's just a lot slower, with a lot more CPU-usage. Hi Oliver, If you have access to the perf command on this system, could you try running: sudo perf top And if that doesn't give you much, sudo perf record -g then: sudo perf report | less during the period of high CPU usage? This will give you a call graph. There may be symbols missing, but it might help track down what the OSDs are doing. Thanks for your help! I did a couple of runs on my test-cluster, loading it with writes from 3 VMs concurrently and measuring the results at the first node with all 0.67 Dumpling-components and with the osds replaced by 0.61.7 Cuttlefish. I let `perf top' run and settle for a while, then copied anything that showed in red and green into this post. Here are the results (sorry for the word-wraps): First, with 0.61.7 osds: 19.91% [kernel][k] intel_idle 10.18% [kernel][k] _raw_spin_lock_irqsave 6.79% ceph-osd[.] ceph_crc32c_le 4.93% [kernel][k] default_send_IPI_mask_sequence_phys 2.71% [kernel][k] copy_user_generic_string 1.42% libc-2.11.3.so [.] memcpy 1.23% [kernel][k] find_busiest_group 1.13% librados.so.2.0.0 [.] ceph_crc32c_le_intel 1.11% [kernel][k] _raw_spin_lock 0.99% kvm [.] 0x1931f8 0.92% [igb] [k] igb_poll 0.87% [kernel][k] native_write_cr0 0.80% [kernel][k] csum_partial 0.78% [kernel][k] __do_softirq 0.63% [kernel][k] hpet_legacy_next_event 0.53% [ip_tables] [k] ipt_do_table 0.50% libc-2.11.3.so [.] 0x74433 Second test, with 0.67 osds: 18.32% [kernel] [k] intel_idle 7.58% [kernel] [k] _raw_spin_lock_irqsave 7.04% ceph-osd [.] PGLog::undirty() 4.39% ceph-osd [.] ceph_crc32c_le_intel 3.92% [kernel] [k] default_send_IPI_mask_sequence_phys 2.25% [kernel] [k] copy_user_generic_string 1.76% libc-2.11.3.so[.] memcpy 1.56% librados.so.2.0.0 [.] ceph_crc32c_le_intel 1.40% libc-2.11.3.so[.] vfprintf 1.12% libc-2.11.3.so[.] 0x7217b 1.05% [kernel] [k] _raw_spin_lock 1.01% [kernel] [k] find_busiest_group 0.83% kvm [.] 0x193ab8 0.80% [kernel] [k] native_write_cr0 0.76% [kernel] [k] __do_softirq 0.73%
[ceph-users] Ceph Deployments
Hello, I just have some small questions about Ceph Deployment models and if this would work for us. Currently the first question would be, is it possible to have a ceph single node setup, where everything is on one node? Our Application, Ceph's object storage and a database? We focus on this deployment model for our very small customers, who only have like 20 members that use our application, so the load wouldn't be very high. And the next question would be, is it possible to extend the Ceph single node to 3 nodes later, if they need more availability? Also we always want to use Shared Nothing Machines, so every service would be on one machine, is this Okai for Ceph, or does Ceph really need a lot of CPU/Memory/Disk Speed? Currently we make an archiving software for small customers and we want to move things on the file system on a object storage. Currently we only have customers that needs 1 machine or 3 machines. But everything should work as fine on more. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Deployments
On 08/19/2013 10:36 AM, Schmitt, Christian wrote: Hello, I just have some small questions about Ceph Deployment models and if this would work for us. Currently the first question would be, is it possible to have a ceph single node setup, where everything is on one node? yes. depends on 'everything', but it's possible (though not recommended) to run mon, mds, and osd's on the same host, and even do virtualisation. Our Application, Ceph's object storage and a database? what is 'a database'? We focus on this deployment model for our very small customers, who only have like 20 members that use our application, so the load wouldn't be very high. And the next question would be, is it possible to extend the Ceph single node to 3 nodes later, if they need more availability? yes. Also we always want to use Shared Nothing Machines, so every service would be on one machine, is this Okai for Ceph, or does Ceph really need a lot of CPU/Memory/Disk Speed? ceph needs cpu / disk speed when disks fail and need to be recovered. it also uses some cpu when you have a lot of i/o, but generally it is rather lightweight. shared nothing is possible with ceph, but in the end this really depends on your application. Currently we make an archiving software for small customers and we want to move things on the file system on a object storage. you mean from the filesystem to an object storage? Currently we only have customers that needs 1 machine or 3 machines. But everything should work as fine on more. it would with ceph. probably :) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage pattern and design of Ceph
On 19/08/13 18:17, Guang Yang wrote: 3. Some industry research shows that one issue of file system is the metadata-to-data ratio, in terms of both access and storage, and some technic uses the mechanism to combine small files to large physical files to reduce the ratio (Haystack for example), if we want to use ceph to store photos, should this be a concern as Ceph use one physical file per object? If you use Ceph as a pure object store, and get and put data via the basic rados api then sure, one client data object will be stored in one Ceph 'object'. However if you use rados gateway (S3 or Swift look-alike api) then each client data object will be broken up into chunks at the rados level (typically 4M sized chunks). Regards Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage pattern and design of Ceph
On 08/19/2013 11:18 AM, Mark Kirkwood wrote: However if you use rados gateway (S3 or Swift look-alike api) then each client data object will be broken up into chunks at the rados level (typically 4M sized chunks). = which is a good thing in terms of replication and OSD usage distribution. Regards Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- DI (FH) Wolfgang Hennerbichler Software Development Unit Advanced Computing Technologies RISC Software GmbH A company of the Johannes Kepler University Linz IT-Center Softwarepark 35 4232 Hagenberg Austria Phone: +43 7236 3343 245 Fax: +43 7236 3343 250 wolfgang.hennerbich...@risc-software.at http://www.risc-software.at ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Destroyed Ceph Cluster
Hello List, The troubles to fix such a cluster continue... I get output like this now: # ceph health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; mds cluster is degraded; mds vvx-ceph-m-03 is laggy When checking for the ceph-mds processes, there are now none left... no matter which server I check. And the won't start up again!? The log starts up with: 2013-08-19 11:23:30.503214 7f7e9dfbd780 0 ceph version 0.67 (e3b7bc5bce8ab330ec1661381072368af3c218a0), process ceph-mds, pid 27636 2013-08-19 11:23:30.523314 7f7e9904b700 1 mds.-1.0 handle_mds_map standby 2013-08-19 11:23:30.529418 7f7e9904b700 1 mds.0.26 handle_mds_map i am now mds.0.26 2013-08-19 11:23:30.529423 7f7e9904b700 1 mds.0.26 handle_mds_map state change up:standby -- up:replay 2013-08-19 11:23:30.529426 7f7e9904b700 1 mds.0.26 replay_start 2013-08-19 11:23:30.529434 7f7e9904b700 1 mds.0.26 recovery set is 2013-08-19 11:23:30.529436 7f7e9904b700 1 mds.0.26 need osdmap epoch 277, have 276 2013-08-19 11:23:30.529438 7f7e9904b700 1 mds.0.26 waiting for osdmap 277 (which blacklists prior instance) 2013-08-19 11:23:30.534090 7f7e9904b700 -1 mds.0.sessionmap _load_finish got (2) No such file or directory 2013-08-19 11:23:30.535483 7f7e9904b700 -1 mds/SessionMap.cc: In function 'void SessionMap::_load_finish(int, ceph::bufferlist)' thread 7f7e9904b700 time 2013-08-19 11:23:30.534107 mds/SessionMap.cc: 83: FAILED assert(0 == failed to load sessionmap) Anyone an idea how to get the cluster back running? Georg On 16.08.2013 16:23, Mark Nelson wrote: Hi Georg, I'm not an expert on the monitors, but that's probably where I would start. Take a look at your monitor logs and see if you can get a sense for why one of your monitors is down. Some of the other devs will probably be around later that might know if there are any known issues with recreating the OSDs and missing PGs. Mark On 08/16/2013 08:21 AM, Georg Höllrigl wrote: Hello, I'm still evaluating ceph - now a test cluster with the 0.67 dumpling. I've created the setup with ceph-deploy from GIT. I've recreated a bunch of OSDs, to give them another journal. There already was some test data on these OSDs. I've already recreated the missing PGs with ceph pg force_create_pg HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean; 5 requests are blocked 32 sec; mds cluster is degraded; 1 mons down, quorum 0,1,2 vvx-ceph-m-01,vvx-ceph-m-02,vvx-ceph-m-03 Any idea how to fix the cluster, besides completley rebuilding the cluster from scratch? What if such a thing happens in a production environment... The pgs from ceph pg dump looks all like creating for some time now: 2.3d0 0 0 0 0 0 0 creating 2013-08-16 13:43:08.186537 0'0 0:0 [] [] 0'0 0.000'0 0.00 Is there a way to just dump the data, that was on the discarded OSDs? Kind Regards, Georg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Deployments
On 2013-08-19 18:36, Schmitt, Christian wrote: Currently the first question would be, is it possible to have a ceph single node setup, where everything is on one node? Yes, definitely, I've currently got a single-node ceph 'cluster', but, to the best of my knowledge, it's not the recommended configuration for long-term usage; in the coming weeks (given this is a home server), I'll be attempting to bring up another two nodes. Our Application, Ceph's object storage and a database? We focus on this deployment model for our very small customers, who only have like 20 members that use our application, so the load wouldn't be very high. And the next question would be, is it possible to extend the Ceph single node to 3 nodes later, if they need more availability? I'm not sure how much ram the monitor and mds take, but each osd (disk) seems to nominally use 300M of ram. My 'server' is a micro-ATX board with 5 spinning disks and a SSD, plugged into a small UPS; total cost about 2000 AUD. It's running a mail-server, backuppc for the other VMs, PCs and laptops in the house, a file-server re-exporting the disk from ceph, and some other random stuff. The VMs chew up a little more than 8G of ram in total, and on the 16G machine, there doesn't seem to be any performance problems (with only two users, mind you). Also we always want to use Shared Nothing Machines, so every service would be on one machine, is this Okai for Ceph, or does Ceph really need a lot of CPU/Memory/Disk Speed? Currently we make an archiving software for small customers and we want to move things on the file system on a object storage. Currently we only have customers that needs 1 machine or 3 machines. But everything should work as fine on more. Depending on your definition of 'machine', a cluster of 3 smaller machines may be substitutable for a single larger one; with the hope that hardware failure only takes out 1 node, leaving the whole cluster still online and able to be restored to full capacity at your (relative) leisure, rather than Right Now, as the backups aren't running anymore... The two 'new' nodes I'm spinning up are my old desktop machine and its predecessor, which, arguably could be construed as being 'free'. =) For firms of your target size, it may be an effective thing to suggest upgrading one or more desktops, and use the old machines to run the backup system on. Especially if you're charging for the service provided, more than for the hardware, you may be able to consolidate multiple existing servers into VMs running on a ceph cluster, with enough spare capacity to also run your backup suite, with minimal to no actual hardware outlay. -- Martin Rudat ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Assert and monitor-crash when attemting to create pool-snapshots while rbd-snapshots are in use or have been used on a pool
On 08/18/2013 07:11 PM, Oliver Daudey wrote: Hey all, Also created on the tracker, under http://tracker.ceph.com/issues/6047 While playing around on my test-cluster, I ran into a problem that I've seen before, but have never been able to reproduce until now. The use of pool-snapshots and rbd-snapshots seems to be mutually exclusive in the same pool, even if you have used one type of snapshot before and have since deleted all snapshots of that type. Unfortunately, the condition doesn't appear to be handled gracefully yet, leading, in one case, to monitors crashing. I think this one goes back at least as far as Bobtail and still exists in Dumpling. My cluster is a straightforward one with 3 Debian Squeeze-nodes, each running a mon, mds and osd. To reproduce: # ceph osd pool create test 256 256 pool 'test' created # ceph osd pool mksnap test snapshot created pool test snap snapshot # ceph osd pool rmsnap test snapshot removed pool test snap snapshot So far, so good. Now we try to create an rbd-snapshot in the same pool: # rbd --pool=test create --size=102400 image # rbd --pool=test snap create image@snapshot rbd: failed to create snapshot: (22) Invalid argument 2013-08-18 19:27:50.892291 7f983bc10780 -1 librbd: failed to create snap id: (22) Invalid argument That failed, but at least the cluster is OK. Now we start over again and create the rbd-snapshot first: # ceph osd pool delete test test --yes-i-really-really-mean-it pool 'test' deleted # ceph osd pool create test 256 256 pool 'test' created # rbd --pool=test create --size=102400 image # rbd --pool=test snap create image@snapshot # rbd --pool=test snap ls image SNAPID NAME SIZE 2 snapshot 102400 MB # rbd --pool=test snap rm image@snapshot # ceph osd pool mksnap test snapshot 2013-08-18 19:35:59.494551 7f48d75a1700 0 monclient: hunting for new mon ^CError EINTR: (I pressed CTRL-C) Thanks for the steps to reproduce Oliver! Managed to reproduce this on 0.67.1 on the first attempt. This bug appears to be the same as #5959 on the tracker. I spent some time last week looking into it, and although I realized it was far too easy to trigger it on cuttlefish, I never managed to trigger it on next -- which I attributed to d1501938f5d07c067d908501fc5cfe3c857d7281. I'll be looking into this. -Joao My leader monitor crashed at that last command, here's the apparent critical point in the logs: -3 2013-08-18 19:35:59.315956 7f9b870b1700 1 -- 194.109.43.18:6789/0 == c lient.5856 194.109.43.18:0/1030570 8 mon_command({snap: snapshot, pref ix: osd pool mksnap, pool: test} v 0) v1 107+0+0 (983560 0 0) 0x23e4200 con 0x2d202c0 -2 2013-08-18 19:35:59.316020 7f9b870b1700 0 mon.a@0(leader) e1 handle_command mon_command({snap: snapshot, prefix: osd pool mksnap, pool: test} v 0) v1 -1 2013-08-18 19:35:59.316033 7f9b870b1700 1 mon.a@0(leader).paxos(paxos active c 1190049..1190629) is_readable now=2013-08-18 19:35:59.316034 lease_expire=2013-08-18 19:36:03.535809 has v0 lc 1190629 0 2013-08-18 19:35:59.317612 7f9b870b1700 -1 osd/osd_types.cc: In function 'void pg_pool_t::add_snap(const char*, utime_t)' thread 7f9b870b1700 time 2013-08-18 19:35:59.316102 osd/osd_types.cc: 682: FAILED assert(!is_unmanaged_snaps_mode()) Apart from fixing this assert and maybe giving a more clear error-message with the failed creation of the rbd-snapshot, maybe there should be a way to switch from one snaps_mode to the other without having to delete the entire pool, if one doesn't already exist. BTW: How exactly does one use the pool-snapshots? There doesn't seem to be a documented way of listing or using them after creation. More info available on request. Regards, Oliver ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Joao Eduardo Luis Software Engineer | http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Deployments
Date: Mon, 19 Aug 2013 10:50:25 +0200 From: Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph Deployments Message-ID: 5211dc51.4070...@risc-software.at Content-Type: text/plain; charset=ISO-8859-1 On 08/19/2013 10:36 AM, Schmitt, Christian wrote: Hello, I just have some small questions about Ceph Deployment models and if this would work for us. Currently the first question would be, is it possible to have a ceph single node setup, where everything is on one node? yes. depends on 'everything', but it's possible (though not recommended) to run mon, mds, and osd's on the same host, and even do virtualisation. Currently we don't want to virtualise on this machine since the machine is really small, as said we focus on small to midsize businesses. Most of the time they even need a tower server due to the lack of a correct rack. ;/ Our Application, Ceph's object storage and a database? what is 'a database'? We run Postgresql or MariaDB (without/with Galera depending on the cluster size) We focus on this deployment model for our very small customers, who only have like 20 members that use our application, so the load wouldn't be very high. And the next question would be, is it possible to extend the Ceph single node to 3 nodes later, if they need more availability? yes. Thats good! Also we always want to use Shared Nothing Machines, so every service would be on one machine, is this Okai for Ceph, or does Ceph really need a lot of CPU/Memory/Disk Speed? ceph needs cpu / disk speed when disks fail and need to be recovered. it also uses some cpu when you have a lot of i/o, but generally it is rather lightweight. shared nothing is possible with ceph, but in the end this really depends on your application. hm, when disk fails we already doing some backup on a dell powervault rd1000, so i don't think thats a problem and also we would run ceph on a Dell PERC Raid Controller with RAID1 enabled on the data disk. Currently we make an archiving software for small customers and we want to move things on the file system on a object storage. you mean from the filesystem to an object storage? yes, currently everything is on the filesystem and this is really horrible, thousands of pdfs just on the filesystem. we can't scale up that easily with this setup. Currently we run on Microsoft Servers, but we plan to rewrite our whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5, 7, 9, ... X²-1 should be possible. Currently we only have customers that needs 1 machine or 3 machines. But everything should work as fine on more. it would with ceph. probably :) That's nice to hear. I was really scared that we don't find a solution that can run on 1 system and scale up to even more. We first looked at HDFS but this isn't lightweight. And the overhead of Metadata etc. just isn't that cool. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Deployments
On 08/19/2013 12:01 PM, Schmitt, Christian wrote: yes. depends on 'everything', but it's possible (though not recommended) to run mon, mds, and osd's on the same host, and even do virtualisation. Currently we don't want to virtualise on this machine since the machine is really small, as said we focus on small to midsize businesses. Most of the time they even need a tower server due to the lack of a correct rack. ;/ whoa :) Our Application, Ceph's object storage and a database? what is 'a database'? We run Postgresql or MariaDB (without/with Galera depending on the cluster size) You wouldn't want to put the data of postgres or mariadb on cephfs. I would run the native versions directly on the servers and use mysql-multi-master circular replication. I don't know about similar features of postgres. shared nothing is possible with ceph, but in the end this really depends on your application. hm, when disk fails we already doing some backup on a dell powervault rd1000, so i don't think thats a problem and also we would run ceph on a Dell PERC Raid Controller with RAID1 enabled on the data disk. this is open to discussion, and really depends on your use case. Currently we make an archiving software for small customers and we want to move things on the file system on a object storage. you mean from the filesystem to an object storage? yes, currently everything is on the filesystem and this is really horrible, thousands of pdfs just on the filesystem. we can't scale up that easily with this setup. Got it. Currently we run on Microsoft Servers, but we plan to rewrite our whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5, 7, 9, ... X²-1 should be possible. cool. Currently we only have customers that needs 1 machine or 3 machines. But everything should work as fine on more. it would with ceph. probably :) That's nice to hear. I was really scared that we don't find a solution that can run on 1 system and scale up to even more. We first looked at HDFS but this isn't lightweight. not only that, HDFS also has a single point of failure. And the overhead of Metadata etc. just isn't that cool. :) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Deploy Ceph on RHEL6.4
Hi ceph-users, I would like to check if there is any manual / steps which can let me try to deploy ceph in RHEL? Thanks, Guang___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Deployments
2013/8/19 Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at: On 08/19/2013 12:01 PM, Schmitt, Christian wrote: yes. depends on 'everything', but it's possible (though not recommended) to run mon, mds, and osd's on the same host, and even do virtualisation. Currently we don't want to virtualise on this machine since the machine is really small, as said we focus on small to midsize businesses. Most of the time they even need a tower server due to the lack of a correct rack. ;/ whoa :) Yep that's awful. Our Application, Ceph's object storage and a database? what is 'a database'? We run Postgresql or MariaDB (without/with Galera depending on the cluster size) You wouldn't want to put the data of postgres or mariadb on cephfs. I would run the native versions directly on the servers and use mysql-multi-master circular replication. I don't know about similar features of postgres. No i don't want to put a MariaDB Cluster on CephFS we want to put PDFs in CephFS or Ceph's Object Storage and hold a key or path in the database, also other things like user management will belong to the database shared nothing is possible with ceph, but in the end this really depends on your application. hm, when disk fails we already doing some backup on a dell powervault rd1000, so i don't think thats a problem and also we would run ceph on a Dell PERC Raid Controller with RAID1 enabled on the data disk. this is open to discussion, and really depends on your use case. Yeah we definitely know that it isn't good to use Ceph on a single node, but i think it's easier to design the application that it will depends on ceph. it wouldn't be easy to manage to have a single node without ceph and more than 1 node with ceph. Currently we make an archiving software for small customers and we want to move things on the file system on a object storage. you mean from the filesystem to an object storage? yes, currently everything is on the filesystem and this is really horrible, thousands of pdfs just on the filesystem. we can't scale up that easily with this setup. Got it. Currently we run on Microsoft Servers, but we plan to rewrite our whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5, 7, 9, ... X²-1 should be possible. cool. Currently we only have customers that needs 1 machine or 3 machines. But everything should work as fine on more. it would with ceph. probably :) That's nice to hear. I was really scared that we don't find a solution that can run on 1 system and scale up to even more. We first looked at HDFS but this isn't lightweight. not only that, HDFS also has a single point of failure. And the overhead of Metadata etc. just isn't that cool. :) Yeah that's why I came to Ceph. I think that's probably the way we want to go. Really thank you for your help. It's good to know that I have a solution for the things that are badly designed on our current solution. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Deploy Ceph on RHEL6.4
On Mon, Aug 19, 2013 at 6:09 PM, Guang Yang yguan...@yahoo.com wrote: Hi ceph-users, I would like to check if there is any manual / steps which can let me try to deploy ceph in RHEL? Setup with ceph-deploy: http://dachary.org/?p=1971 Official documentation will also be helpful: http://ceph.com/docs/master/start/quick-ceph-deploy/ -- -Thanks. - xan.peng ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) (ceph 0.61.7)
Hi, I have an OSD which crash every time I try to start it (see logs below). Is it a known problem ? And is there a way to fix it ? root! taman:/var/log/ceph# grep -v ' pipe' osd.65.log 2013-08-19 11:07:48.478558 7f6fe367a780 0 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff), process ceph-osd, pid 19327 2013-08-19 11:07:48.516363 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and appears to work 2013-08-19 11:07:48.516380 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-08-19 11:07:48.516514 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs 2013-08-19 11:07:48.517087 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully supported 2013-08-19 11:07:48.517389 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount found snaps 2013-08-19 11:07:49.199483 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2013-08-19 11:07:52.191336 7f6fe367a780 1 journal _open /dev/sdk4 fd 18: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-08-19 11:07:52.196020 7f6fe367a780 1 journal _open /dev/sdk4 fd 18: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-08-19 11:07:52.196920 7f6fe367a780 1 journal close /dev/sdk4 2013-08-19 11:07:52.199908 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and appears to work 2013-08-19 11:07:52.199916 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-08-19 11:07:52.200058 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs 2013-08-19 11:07:52.200886 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully supported 2013-08-19 11:07:52.200919 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount found snaps 2013-08-19 11:07:52.215850 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2013-08-19 11:07:52.219819 7f6fe367a780 1 journal _open /dev/sdk4 fd 26: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-08-19 11:07:52.227420 7f6fe367a780 1 journal _open /dev/sdk4 fd 26: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-08-19 11:07:52.500342 7f6fe367a780 0 osd.65 144201 crush map has features 262144, adjusting msgr requires for clients 2013-08-19 11:07:52.500353 7f6fe367a780 0 osd.65 144201 crush map has features 262144, adjusting msgr requires for osds 2013-08-19 11:08:13.581709 7f6fbdcb5700 -1 osd/OSD.cc: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f6fbdcb5700 time 2013-08-19 11:08:13.579519 osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) 1: (OSDService::get_map(unsigned int)+0x44b) [0x6f5b9b] 2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle, PG::RecoveryCtx*, std::setboost::intrusive_ptrPG, std::lessboost::intrusive_ptrPG , std::allocatorboost::intrusive_ptrPG *)+0x3c8) [0x6f8f48] 3: (OSD::process_peering_events(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x31f) [0x6f975f] 4: (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x14) [0x7391d4] 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8f8e3a] 6: (ThreadPool::WorkThread::entry()+0x10) [0x8fa0e0] 7: (()+0x6b50) [0x7f6fe3070b50] 8: (clone()+0x6d) [0x7f6fe15cba7d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. full logs here : http://pastebin.com/RphNyLU0 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Poor write/random read/random write performance
I have a 3 nodes, 15 osds ceph cluster setup:* 15 7200 RPM SATA disks, 5 for each node.* 10G network* Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.* 64G Ram for each node. I deployed the cluster with ceph-deploy, and created a new data pool for cephfs.Both the data and metadata pools are set with replica size 3.Then mounted the cephfs on one of the three nodes, and tested the performance with fio. The sequential read performance looks good:fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec But the sequential write/random read/random write performance is very poor:fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msecfio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msecfio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec I am mostly surprised by the seq write performance comparing to the raw sata disk performance(It can get 4127 IOPS when mounted with ext4). My cephfs only gets 1/10 performance of the raw disk. How can I tune my cluster to improve the sequential write/random read/random write performance? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] dumpling ceph cli tool breaks openstack cinder
Hi, I just noticed that in dumpling the ceph cli tool no longer utilises the CEPH_ARGS environment variable. This is used by openstack cinder to specifiy the cephx user. Ref: http://ceph.com/docs/next/rbd/rbd-openstack/#configure-openstack-to-use-ceph I modifiied this line in /usr/share/pyshared/cinder/volume/driver.py stdout, _ = self._execute('ceph', 'fsid') stdout, _ = self._execute('ceph', '--id', 'volumes', 'fsid') For my particular setup this seems to be sufficient as a quick workaround. Is there a proper way to do this with the new tool? Note: This only hit when i tried to create a volume from an image (i'm using copy on write cloning). creating a fresh volume didnt invoke the ceph fsid command in the openstack script, so i guess some openstack users will not be affected. Thanks, Øystein___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor write/random read/random write performance
Sorry, forget to tell the OS and kernel version. It's Centos 6.4 with kernel 3.10.6 .fio 2.0.13 . From: dachun...@outlook.com To: ceph-users@lists.ceph.com Date: Mon, 19 Aug 2013 11:28:24 + Subject: [ceph-users] Poor write/random read/random write performance I have a 3 nodes, 15 osds ceph cluster setup:* 15 7200 RPM SATA disks, 5 for each node.* 10G network* Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node.* 64G Ram for each node. I deployed the cluster with ceph-deploy, and created a new data pool for cephfs.Both the data and metadata pools are set with replica size 3.Then mounted the cephfs on one of the three nodes, and tested the performance with fio. The sequential read performance looks good:fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec But the sequential write/random read/random write performance is very poor:fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msecfio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msecfio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec I am mostly surprised by the seq write performance comparing to the raw sata disk performance(It can get 4127 IOPS when mounted with ext4). My cephfs only gets 1/10 performance of the raw disk. How can I tune my cluster to improve the sequential write/random read/random write performance? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor write/random read/random write performance
On 08/19/2013 06:28 AM, Da Chun Ng wrote: I have a 3 nodes, 15 osds ceph cluster setup: * 15 7200 RPM SATA disks, 5 for each node. * 10G network * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node. * 64G Ram for each node. I deployed the cluster with ceph-deploy, and created a new data pool for cephfs. Both the data and metadata pools are set with replica size 3. Then mounted the cephfs on one of the three nodes, and tested the performance with fio. The sequential read performance looks good: fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60 read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec Sounds like readahead and or caching is helping out a lot here. Btw, you might want to make sure this is actually coming from the disks with iostat or collectl or something. But the sequential write/random read/random write performance is very poor: fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec One thing to keep in mind is that unless you have SSDs in this system, you will be doing 2 writes for every client write to the spinning disks (since data and journals will both be on the same disk). So let's do the math: 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 (KB-bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive If there is no write coalescing going on, this isn't terrible. If there is, this is terrible. Have you tried buffered writes with the sync engine at the same IO size? fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec In this case: 11087 * 1024 (KB-bytes) / 16384 / 15 = ~46 IOPS / drive. Definitely not great! You might want to try fiddling with read ahead both on the CephFS client and on the block devices under the OSDs themselves. One thing I did notice back during bobtail is that increasing the number of osd op threads seemed to help small object read performance. It might be worth looking at too. http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread Other than that, if you really want to dig into this, you can use tools like iostat, collectl, blktrace, and seekwatcher to try and get a feel for what the IO going to the OSDs looks like. That can help when diagnosing this sort of thing. fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec 6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024 (KB-bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS / drive I am mostly surprised by the seq write performance comparing to the raw sata disk performance(It can get 4127 IOPS when mounted with ext4). My cephfs only gets 1/10 performance of the raw disk. 7200 RPM spinning disks typically top out at something like 150 IOPS (and some are lower). With 15 disks, to hit 4127 IOPS you were probably seeing some write coalescing effects (or if these were random reads, some benefit to read ahead). How can I tune my cluster to improve the sequential write/random read/random write performance? I don't know what kind of controller you have, but in cases where journals are on the same disks as the data, using writeback cache helps a lot because the controller can coalesce the direct IO journal writes in cache and just do big periodic dumps to the drives. That really reduces seek overhead for the writes. Using SSDs for the journals accomplishes much of the same effect, and lets you get faster large IO writes too, but in many chassis there is a density (and cost) trade-off. Hope this helps! Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Request preinstalled Virtual Machines Images for cloning.
Dear Ceph Developers and Users, I was wondering if there is any download location for preinstalled virtual machine images with the latest release of ceph. Preferably 4 different images with Ceph-OSD, Ceph-Mon, Ceph-MDS and last but not least a Ceph-Client with iscsi target server installed. But since the latter is the client, I guess any distro would do. If this doesn't exist, maybe it's a great idea for distribution from the ceph.com website. I could just startup an image like ceph-osd on any hypervisor to add its local storage via disk passthrough to my ceph private cloud, and just distribute some monitors and metadata servers over the rest of the hypervisors. Packages like this can be kept small (like using slitaz forexample - since this one performs best on hyper-v hypervisors). Any Ideas? Regards, Johannes __ Informatie van ESET Endpoint Antivirus, versie van database viruskenmerken 8703 (20130819) __ Het bericht is gecontroleerd door ESET Endpoint Antivirus. http://www.eset.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor write/random read/random write performance
Thanks very much! Mark.Yes, I put the data and journal on the same disk, no SSD in my environment.My controllers are general SATA II. Some more questions below in blue. Date: Mon, 19 Aug 2013 07:48:23 -0500 From: mark.nel...@inktank.com To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Poor write/random read/random write performance On 08/19/2013 06:28 AM, Da Chun Ng wrote: I have a 3 nodes, 15 osds ceph cluster setup: * 15 7200 RPM SATA disks, 5 for each node. * 10G network * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node. * 64G Ram for each node. I deployed the cluster with ceph-deploy, and created a new data pool for cephfs. Both the data and metadata pools are set with replica size 3. Then mounted the cephfs on one of the three nodes, and tested the performance with fio. The sequential read performance looks good: fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60 read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec Sounds like readahead and or caching is helping out a lot here. Btw, you might want to make sure this is actually coming from the disks with iostat or collectl or something. I ran sync echo 3 | tee /proc/sys/vm/drop_caches on all the nodes before every test. I used collectl to watch every disk IO, the numbers should match. I think readahead is helping here. But the sequential write/random read/random write performance is very poor: fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec One thing to keep in mind is that unless you have SSDs in this system, you will be doing 2 writes for every client write to the spinning disks (since data and journals will both be on the same disk). So let's do the math: 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 (KB-bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive If there is no write coalescing going on, this isn't terrible. If there is, this is terrible. How can I know if there is write coalescing going on? Have you tried buffered writes with the sync engine at the same IO size? Do you mean as below?fio -direct=0 -iodepth 1 -thread -rw=write -ioengine=sync -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec In this case: 11087 * 1024 (KB-bytes) / 16384 / 15 = ~46 IOPS / drive. Definitely not great! You might want to try fiddling with read ahead both on the CephFS client and on the block devices under the OSDs themselves. Could you please tell me how to enable read ahead on the CephFS client? For the block devices under the OSDs, the read ahead value is:[root@ceph0 ~]# blockdev --getra /dev/sdi256How big is appropriate for it? One thing I did notice back during bobtail is that increasing the number of osd op threads seemed to help small object read performance. It might be worth looking at too. http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread Other than that, if you really want to dig into this, you can use tools like iostat, collectl, blktrace, and seekwatcher to try and get a feel for what the IO going to the OSDs looks like. That can help when diagnosing this sort of thing. fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec 6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024 (KB-bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS / drive I am mostly surprised by the seq write performance comparing to the raw sata disk performance(It can get 4127 IOPS when mounted with ext4). My cephfs only gets 1/10 performance of the raw disk.
Re: [ceph-users] dumpling ceph cli tool breaks openstack cinder
On Mon, 19 Aug 2013, S?bastien Han wrote: Hi, The new version of the driver (for Havana) doesn't need the CEPH_ARGS argument, the driver now uses the librbd and librados (not the CLI anymore). I guess a better patch will result in: stdout, _ = self._execute('ceph', '--id', 'self.configuration.rbd_user', 'fsid') I'll report the bug. Thanks! However I don't know how to fix this with the new CLI. I opened http://tracker.ceph.com/issues/6052. This is a simple matter of adding a call to rados_conf_parse_env(...). Thanks! sage Cheers. S?bastien Han Cloud Engineer Always give 100%. Unless you're giving blood. Phone: +33 (0)1 49 70 99 72 - Mobile: +33 (0)6 52 84 44 70 Mail: sebastien@enovance.com - Skype : han.sbastien Address : 10, rue de la Victoire - 75009 Paris Web : www.enovance.com - Twitter : @enovance On August 19, 2013 at 1:28:57 PM, ?ystein L?nning Nerhus (ner...@vx.no) wrote: Hi, I just noticed that in dumpling the ceph cli tool no longer utilises the CEPH_ARGS environment variable. This is used by openstack cinder to specifiy the cephx user. Ref: http://ceph.com/docs/next/rbd/rbd-openstack/#configure-openstack-to-use-ceph I modifiied this line in /usr/share/pyshared/cinder/volume/driver.py stdout, _ = self._execute('ceph', 'fsid') stdout, _ = self._execute('ceph', '--id', 'volumes', 'fsid') For my particular setup this seems to be sufficient as a quick workaround. Is there a proper way to do this with the new tool? Note: This only hit when i tried to create a volume from an image (i'm using copy on write cloning). creating a fresh volume didnt invoke the ceph fsid command in the openstack script, so i guess some openstack users will not be affected. Thanks, ?ystein___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Destroyed Ceph Cluster
Have you ever used the FS? It's missing an object which we're intermittently seeing failures to create (on initial setup) when the cluster is unstable. If so, clear out the metadata pool and check the docs for newfs. -Greg On Monday, August 19, 2013, Georg Höllrigl wrote: Hello List, The troubles to fix such a cluster continue... I get output like this now: # ceph health HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; mds cluster is degraded; mds vvx-ceph-m-03 is laggy When checking for the ceph-mds processes, there are now none left... no matter which server I check. And the won't start up again!? The log starts up with: 2013-08-19 11:23:30.503214 7f7e9dfbd780 0 ceph version 0.67 (** e3b7bc5bce8ab330ec166138107236**8af3c218a0), process ceph-mds, pid 27636 2013-08-19 11:23:30.523314 7f7e9904b700 1 mds.-1.0 handle_mds_map standby 2013-08-19 11:23:30.529418 7f7e9904b700 1 mds.0.26 handle_mds_map i am now mds.0.26 2013-08-19 11:23:30.529423 7f7e9904b700 1 mds.0.26 handle_mds_map state change up:standby -- up:replay 2013-08-19 11:23:30.529426 7f7e9904b700 1 mds.0.26 replay_start 2013-08-19 11:23:30.529434 7f7e9904b700 1 mds.0.26 recovery set is 2013-08-19 11:23:30.529436 7f7e9904b700 1 mds.0.26 need osdmap epoch 277, have 276 2013-08-19 11:23:30.529438 7f7e9904b700 1 mds.0.26 waiting for osdmap 277 (which blacklists prior instance) 2013-08-19 11:23:30.534090 7f7e9904b700 -1 mds.0.sessionmap _load_finish got (2) No such file or directory 2013-08-19 11:23:30.535483 7f7e9904b700 -1 mds/SessionMap.cc: In function 'void SessionMap::_load_finish(int, ceph::bufferlist)' thread 7f7e9904b700 time 2013-08-19 11:23:30.534107 mds/SessionMap.cc: 83: FAILED assert(0 == failed to load sessionmap) Anyone an idea how to get the cluster back running? Georg On 16.08.2013 16:23, Mark Nelson wrote: Hi Georg, I'm not an expert on the monitors, but that's probably where I would start. Take a look at your monitor logs and see if you can get a sense for why one of your monitors is down. Some of the other devs will probably be around later that might know if there are any known issues with recreating the OSDs and missing PGs. Mark On 08/16/2013 08:21 AM, Georg Höllrigl wrote: Hello, I'm still evaluating ceph - now a test cluster with the 0.67 dumpling. I've created the setup with ceph-deploy from GIT. I've recreated a bunch of OSDs, to give them another journal. There already was some test data on these OSDs. I've already recreated the missing PGs with ceph pg force_create_pg HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean; 5 requests are blocked 32 sec; mds cluster is degraded; 1 mons down, quorum 0,1,2 vvx-ceph-m-01,vvx-ceph-m-02,**vvx-ceph-m-03 Any idea how to fix the cluster, besides completley rebuilding the cluster from scratch? What if such a thing happens in a production environment... The pgs from ceph pg dump looks all like creating for some time now: 2.3d0 0 0 0 0 0 0 creating 2013-08-16 13:43:08.186537 0'0 0:0 [] [] 0'0 0.000'0 0.00 Is there a way to just dump the data, that was on the discarded OSDs? Kind Regards, Georg __**_ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com __**_ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com __**_ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor write/random read/random write performance
On 08/19/2013 08:59 AM, Da Chun Ng wrote: Thanks very much! Mark. Yes, I put the data and journal on the same disk, no SSD in my environment. My controllers are general SATA II. Ok, so in this case the lack of WB cache on the controller and no SSDs for journals is probably having an effect. Some more questions below in blue. Date: Mon, 19 Aug 2013 07:48:23 -0500 From: mark.nel...@inktank.com To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Poor write/random read/random write performance On 08/19/2013 06:28 AM, Da Chun Ng wrote: I have a 3 nodes, 15 osds ceph cluster setup: * 15 7200 RPM SATA disks, 5 for each node. * 10G network * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node. * 64G Ram for each node. I deployed the cluster with ceph-deploy, and created a new data pool for cephfs. Both the data and metadata pools are set with replica size 3. Then mounted the cephfs on one of the three nodes, and tested the performance with fio. The sequential read performance looks good: fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60 read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec Sounds like readahead and or caching is helping out a lot here. Btw, you might want to make sure this is actually coming from the disks with iostat or collectl or something. I ran sync echo 3 | tee /proc/sys/vm/drop_caches on all the nodes before every test. I used collectl to watch every disk IO, the numbers should match. I think readahead is helping here. Ok, good! I suspect that readahead is indeed helping. But the sequential write/random read/random write performance is very poor: fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec One thing to keep in mind is that unless you have SSDs in this system, you will be doing 2 writes for every client write to the spinning disks (since data and journals will both be on the same disk). So let's do the math: 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 (KB-bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive If there is no write coalescing going on, this isn't terrible. If there is, this is terrible. How can I know if there is write coalescing going on? look in collectl at the average IO sizes going to the disks. I bet they will be 16KB. If you were to look further with blktrace and seekwatcher, I bet you'd see lots of seeking between OSD data writes and journal writes since there is no controller cache helping smooth things out (and your journals are on the same drives). Have you tried buffered writes with the sync engine at the same IO size? Do you mean as below? fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 Yeah, that'd work. fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec In this case: 11087 * 1024 (KB-bytes) / 16384 / 15 = ~46 IOPS / drive. Definitely not great! You might want to try fiddling with read ahead both on the CephFS client and on the block devices under the OSDs themselves. Could you please tell me how to enable read ahead on the CephFS client? It's one of the mount options: http://ceph.com/docs/master/man/8/mount.ceph/ For the block devices under the OSDs, the read ahead value is: [root@ceph0 ~]# blockdev --getra /dev/sdi 256 How big is appropriate for it? To be honest I've seen different results depending on the hardware. I'd try anywhere from 32kb to 2048kb. One thing I did notice back during bobtail is that increasing the number of osd op threads seemed to help small object read performance. It might be worth looking at too. http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread Other than that, if you really want to dig into this, you can use tools like iostat, collectl, blktrace, and seekwatcher to try and get a feel for what the IO going to the OSDs looks like. That can help when diagnosing this sort of thing. fio -direct=1 -iodepth 1 -thread -rw=randwrite -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 write: io=361056KB, bw=6001.1KB/s, iops=375 , runt= 60157msec 6001.1KB/s * 3 replication * 2 (journal + data writes) * 1024 (KB-bytes) / 16384 (write size in bytes) / 15 drives = ~150 IOPS / drive I am mostly surprised by the seq write
Re: [ceph-users] Poor write/random read/random write performance
Thank you! Testing now. How about pg num? I'm using the default size 64, as I tried with (100 * osd_num)/replica_size, but it decreased the performance surprisingly. Date: Mon, 19 Aug 2013 11:33:30 -0500 From: mark.nel...@inktank.com To: dachun...@outlook.com CC: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Poor write/random read/random write performance On 08/19/2013 08:59 AM, Da Chun Ng wrote: Thanks very much! Mark. Yes, I put the data and journal on the same disk, no SSD in my environment. My controllers are general SATA II. Ok, so in this case the lack of WB cache on the controller and no SSDs for journals is probably having an effect. Some more questions below in blue. Date: Mon, 19 Aug 2013 07:48:23 -0500 From: mark.nel...@inktank.com To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Poor write/random read/random write performance On 08/19/2013 06:28 AM, Da Chun Ng wrote: I have a 3 nodes, 15 osds ceph cluster setup: * 15 7200 RPM SATA disks, 5 for each node. * 10G network * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node. * 64G Ram for each node. I deployed the cluster with ceph-deploy, and created a new data pool for cephfs. Both the data and metadata pools are set with replica size 3. Then mounted the cephfs on one of the three nodes, and tested the performance with fio. The sequential read performance looks good: fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60 read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec Sounds like readahead and or caching is helping out a lot here. Btw, you might want to make sure this is actually coming from the disks with iostat or collectl or something. I ran sync echo 3 | tee /proc/sys/vm/drop_caches on all the nodes before every test. I used collectl to watch every disk IO, the numbers should match. I think readahead is helping here. Ok, good! I suspect that readahead is indeed helping. But the sequential write/random read/random write performance is very poor: fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec One thing to keep in mind is that unless you have SSDs in this system, you will be doing 2 writes for every client write to the spinning disks (since data and journals will both be on the same disk). So let's do the math: 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 (KB-bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive If there is no write coalescing going on, this isn't terrible. If there is, this is terrible. How can I know if there is write coalescing going on? look in collectl at the average IO sizes going to the disks. I bet they will be 16KB. If you were to look further with blktrace and seekwatcher, I bet you'd see lots of seeking between OSD data writes and journal writes since there is no controller cache helping smooth things out (and your journals are on the same drives). Have you tried buffered writes with the sync engine at the same IO size? Do you mean as below? fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 Yeah, that'd work. fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec In this case: 11087 * 1024 (KB-bytes) / 16384 / 15 = ~46 IOPS / drive. Definitely not great! You might want to try fiddling with read ahead both on the CephFS client and on the block devices under the OSDs themselves. Could you please tell me how to enable read ahead on the CephFS client? It's one of the mount options: http://ceph.com/docs/master/man/8/mount.ceph/ For the block devices under the OSDs, the read ahead value is: [root@ceph0 ~]# blockdev --getra /dev/sdi 256 How big is appropriate for it? To be honest I've seen different results depending on the hardware. I'd try anywhere from 32kb to 2048kb. One thing I did notice back during bobtail is that increasing the number of osd op threads seemed to help small object read performance. It might be worth looking at too. http://ceph.com/community/ceph-bobtail-jbod-performance-tuning/#4kbradosread Other than that, if you really want to dig into this, you can use tools like iostat, collectl, blktrace, and seekwatcher to try and get a feel for what the IO
Re: [ceph-users] Poor write/random read/random write performance
On 08/19/2013 12:05 PM, Da Chun Ng wrote: Thank you! Testing now. How about pg num? I'm using the default size 64, as I tried with (100 * osd_num)/replica_size, but it decreased the performance surprisingly. Oh! That's odd! Typically you would want more than that. Most likely you aren't distributing PGs very evenly across OSDs with 64. More PGs shouldn't decrease performance unless the monitors are behaving badly. We saw some issues back in early cuttlefish but you should be fine with many more PGs. Mark Date: Mon, 19 Aug 2013 11:33:30 -0500 From: mark.nel...@inktank.com To: dachun...@outlook.com CC: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Poor write/random read/random write performance On 08/19/2013 08:59 AM, Da Chun Ng wrote: Thanks very much! Mark. Yes, I put the data and journal on the same disk, no SSD in my environment. My controllers are general SATA II. Ok, so in this case the lack of WB cache on the controller and no SSDs for journals is probably having an effect. Some more questions below in blue. Date: Mon, 19 Aug 2013 07:48:23 -0500 From: mark.nel...@inktank.com To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Poor write/random read/random write performance On 08/19/2013 06:28 AM, Da Chun Ng wrote: I have a 3 nodes, 15 osds ceph cluster setup: * 15 7200 RPM SATA disks, 5 for each node. * 10G network * Intel(R) Xeon(R) CPU E5-2620(6 cores) 2.00GHz, for each node. * 64G Ram for each node. I deployed the cluster with ceph-deploy, and created a new data pool for cephfs. Both the data and metadata pools are set with replica size 3. Then mounted the cephfs on one of the three nodes, and tested the performance with fio. The sequential read performance looks good: fio -direct=1 -iodepth 1 -thread -rw=read -ioengine=libaio -bs=16K -size=1G -numjobs=16 -group_reporting -name=mytest -runtime 60 read : io=10630MB, bw=181389KB/s, iops=11336 , runt= 60012msec Sounds like readahead and or caching is helping out a lot here. Btw, you might want to make sure this is actually coming from the disks with iostat or collectl or something. I ran sync echo 3 | tee /proc/sys/vm/drop_caches on all the nodes before every test. I used collectl to watch every disk IO, the numbers should match. I think readahead is helping here. Ok, good! I suspect that readahead is indeed helping. But the sequential write/random read/random write performance is very poor: fio -direct=1 -iodepth 1 -thread -rw=write -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 write: io=397280KB, bw=6618.2KB/s, iops=413 , runt= 60029msec One thing to keep in mind is that unless you have SSDs in this system, you will be doing 2 writes for every client write to the spinning disks (since data and journals will both be on the same disk). So let's do the math: 6618.2KB/s * 3 replication * 2 (journal + data writes) * 1024 (KB-bytes) / 16384 (write size in bytes) / 15 drives = ~165 IOPS / drive If there is no write coalescing going on, this isn't terrible. If there is, this is terrible. How can I know if there is write coalescing going on? look in collectl at the average IO sizes going to the disks. I bet they will be 16KB. If you were to look further with blktrace and seekwatcher, I bet you'd see lots of seeking between OSD data writes and journal writes since there is no controller cache helping smooth things out (and your journals are on the same drives). Have you tried buffered writes with the sync engine at the same IO size? Do you mean as below? fio -direct=0-iodepth 1 -thread -rw=write -ioengine=sync-bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 Yeah, that'd work. fio -direct=1 -iodepth 1 -thread -rw=randread -ioengine=libaio -bs=16K -size=256M -numjobs=16 -group_reporting -name=mytest -runtime 60 read : io=665664KB, bw=11087KB/s, iops=692 , runt= 60041msec In this case: 11087 * 1024 (KB-bytes) / 16384 / 15 = ~46 IOPS / drive. Definitely not great! You might want to try fiddling with read ahead both on the CephFS client and on the block devices under the OSDs themselves. Could you please tell me how to enable read ahead on the CephFS client? It's one of the mount options: http://ceph.com/docs/master/man/8/mount.ceph/ For the block devices under the OSDs, the read ahead value is: [root@ceph0 ~]# blockdev --getra /dev/sdi 256 How big is appropriate for it? To be honest I've seen different results depending on the hardware. I'd try
Re: [ceph-users] Ceph Deployments
Actually, I wrote the Quick Start guides so that you could do exactly what you are trying to do, but mostly from a kick the tires perspective so that people can learn to use Ceph without imposing $100k worth of hardware as a requirement. See http://ceph.com/docs/master/start/quick-ceph-deploy/ I even added a section so that you could do it on one disk--e.g., on your laptop. http://ceph.com/docs/master/start/quick-ceph-deploy/#multiple-osds-on-the-os-disk-demo-only It says demo only, because you won't get great performance out of a single node. Monitors, OSDs, and Journals writing to disk and fsync issues would make performance sub-optimal. For better performance, you should consider a separate drive for each Ceph OSD Daemon if you can, and potentially a separate SSD drive partitioned for journals. If you can separate the OS and monitor drives from the OSD drives, that's better too. I wrote it as a two-node quick start, because you cannot kernel mount the Ceph Filesystem or Ceph Block Devices on the same host as the Ceph Storage Cluster. It's a kernel issue, not a Ceph issue. However, you can get around this too. If your machine has enough RAM and CPU, you can also install virtual machines and kernel mount cephfs and block devices in the virtual machines with no kernel issues. You don't need to use VMs at all for librbd. So you can install QEMU/KVM, libvirt and OpenStack all on the same host too. It's just not an ideal situation from performance or high availability perspective. On Mon, Aug 19, 2013 at 3:12 AM, Schmitt, Christian c.schm...@briefdomain.de wrote: 2013/8/19 Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at: On 08/19/2013 12:01 PM, Schmitt, Christian wrote: yes. depends on 'everything', but it's possible (though not recommended) to run mon, mds, and osd's on the same host, and even do virtualisation. Currently we don't want to virtualise on this machine since the machine is really small, as said we focus on small to midsize businesses. Most of the time they even need a tower server due to the lack of a correct rack. ;/ whoa :) Yep that's awful. Our Application, Ceph's object storage and a database? what is 'a database'? We run Postgresql or MariaDB (without/with Galera depending on the cluster size) You wouldn't want to put the data of postgres or mariadb on cephfs. I would run the native versions directly on the servers and use mysql-multi-master circular replication. I don't know about similar features of postgres. No i don't want to put a MariaDB Cluster on CephFS we want to put PDFs in CephFS or Ceph's Object Storage and hold a key or path in the database, also other things like user management will belong to the database shared nothing is possible with ceph, but in the end this really depends on your application. hm, when disk fails we already doing some backup on a dell powervault rd1000, so i don't think thats a problem and also we would run ceph on a Dell PERC Raid Controller with RAID1 enabled on the data disk. this is open to discussion, and really depends on your use case. Yeah we definitely know that it isn't good to use Ceph on a single node, but i think it's easier to design the application that it will depends on ceph. it wouldn't be easy to manage to have a single node without ceph and more than 1 node with ceph. Currently we make an archiving software for small customers and we want to move things on the file system on a object storage. you mean from the filesystem to an object storage? yes, currently everything is on the filesystem and this is really horrible, thousands of pdfs just on the filesystem. we can't scale up that easily with this setup. Got it. Currently we run on Microsoft Servers, but we plan to rewrite our whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5, 7, 9, ... X²-1 should be possible. cool. Currently we only have customers that needs 1 machine or 3 machines. But everything should work as fine on more. it would with ceph. probably :) That's nice to hear. I was really scared that we don't find a solution that can run on 1 system and scale up to even more. We first looked at HDFS but this isn't lightweight. not only that, HDFS also has a single point of failure. And the overhead of Metadata etc. just isn't that cool. :) Yeah that's why I came to Ceph. I think that's probably the way we want to go. Really thank you for your help. It's good to know that I have a solution for the things that are badly designed on our current solution. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- John Wilkins Senior Technical Writer Intank john.wilk...@inktank.com (415) 425-9599
Re: [ceph-users] RBD and balanced reads
On Mon, Aug 19, 2013 at 9:07 AM, Sage Weil s...@inktank.com wrote: On Mon, 19 Aug 2013, S?bastien Han wrote: Hi guys, While reading a developer doc, I came across the following options: * osd balance reads = true * osd shed reads = true * osd shed reads min latency * osd shed reads min latency diff The problem is that I can't find any of these options in config_opts.h. These are left over from an old unimplemented experiment and were removed a while back. Loic Dachary also gave me a flag that he found from the code. m-get_flags() CEPH_OSD_FLAG_LOCALIZE_READS) So my questions are: * Which from the above flags are correct? * Do balanced reads really exist in RBD? For localized reads you want OPTION(rbd_balance_snap_reads, OPT_BOOL, false) OPTION(rbd_localize_snap_reads, OPT_BOOL, false) Note that the 'localize' logic is still very primitive (it matches by IP address). There is a blueprint to improve this: http://wiki.ceph.com/01Planning/02Blueprints/Emperor/librados%2F%2Fobjecter%3A_smarter_localized_reads Also, there are some issues with read/write consistency when using localized reads because the replicas do not provide the ordering guarantees that primaries will. See http://tracker.ceph.com/issues/5388 At present localized reads are really only suitable for spreading the load on write-once, read-many workloads. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Deployments
What you are trying to do will work, because you will not need any kernel related code for object storage, so a one node setup will work for you. -- Sent from my mobile device On 19.08.2013, at 20:29, Schmitt, Christian c.schm...@briefdomain.de wrote: That sounds bad for me. As said one of the things we consider is a one node setup, for production. Not every Customer will afford hardware worth more than ~4000 Euro. Small business users don't need need the biggest hardware, but i don't think it's a good way to have a version who uses the filesystem and one version who use ceph. We prefer a Object Storage for our Files. It should work like the Object Storage of the App Engine. That scales from 1 to X Servers. 2013/8/19 John Wilkins john.wilk...@inktank.com: Actually, I wrote the Quick Start guides so that you could do exactly what you are trying to do, but mostly from a kick the tires perspective so that people can learn to use Ceph without imposing $100k worth of hardware as a requirement. See http://ceph.com/docs/master/start/quick-ceph-deploy/ I even added a section so that you could do it on one disk--e.g., on your laptop. http://ceph.com/docs/master/start/quick-ceph-deploy/#multiple-osds-on-the-os-disk-demo-only It says demo only, because you won't get great performance out of a single node. Monitors, OSDs, and Journals writing to disk and fsync issues would make performance sub-optimal. For better performance, you should consider a separate drive for each Ceph OSD Daemon if you can, and potentially a separate SSD drive partitioned for journals. If you can separate the OS and monitor drives from the OSD drives, that's better too. I wrote it as a two-node quick start, because you cannot kernel mount the Ceph Filesystem or Ceph Block Devices on the same host as the Ceph Storage Cluster. It's a kernel issue, not a Ceph issue. However, you can get around this too. If your machine has enough RAM and CPU, you can also install virtual machines and kernel mount cephfs and block devices in the virtual machines with no kernel issues. You don't need to use VMs at all for librbd. So you can install QEMU/KVM, libvirt and OpenStack all on the same host too. It's just not an ideal situation from performance or high availability perspective. On Mon, Aug 19, 2013 at 3:12 AM, Schmitt, Christian c.schm...@briefdomain.de wrote: 2013/8/19 Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at: On 08/19/2013 12:01 PM, Schmitt, Christian wrote: yes. depends on 'everything', but it's possible (though not recommended) to run mon, mds, and osd's on the same host, and even do virtualisation. Currently we don't want to virtualise on this machine since the machine is really small, as said we focus on small to midsize businesses. Most of the time they even need a tower server due to the lack of a correct rack. ;/ whoa :) Yep that's awful. Our Application, Ceph's object storage and a database? what is 'a database'? We run Postgresql or MariaDB (without/with Galera depending on the cluster size) You wouldn't want to put the data of postgres or mariadb on cephfs. I would run the native versions directly on the servers and use mysql-multi-master circular replication. I don't know about similar features of postgres. No i don't want to put a MariaDB Cluster on CephFS we want to put PDFs in CephFS or Ceph's Object Storage and hold a key or path in the database, also other things like user management will belong to the database shared nothing is possible with ceph, but in the end this really depends on your application. hm, when disk fails we already doing some backup on a dell powervault rd1000, so i don't think thats a problem and also we would run ceph on a Dell PERC Raid Controller with RAID1 enabled on the data disk. this is open to discussion, and really depends on your use case. Yeah we definitely know that it isn't good to use Ceph on a single node, but i think it's easier to design the application that it will depends on ceph. it wouldn't be easy to manage to have a single node without ceph and more than 1 node with ceph. Currently we make an archiving software for small customers and we want to move things on the file system on a object storage. you mean from the filesystem to an object storage? yes, currently everything is on the filesystem and this is really horrible, thousands of pdfs just on the filesystem. we can't scale up that easily with this setup. Got it. Currently we run on Microsoft Servers, but we plan to rewrite our whole codebase with scaling in mind, from 1 to X Servers. So 1, 3, 5, 7, 9, ... X²-1 should be possible. cool. Currently we only have customers that needs 1 machine or 3 machines. But everything should work as fine on more. it would with ceph. probably :) That's nice to hear. I was really scared that we don't find a solution that can run on 1
Re: [ceph-users] Ceph Deployments
Wolfgang is correct. You do not need VMs at all if you are setting up Ceph Object Storage. It's just Apache, FastCGI, and the radosgw daemon interacting with the Ceph Storage Cluster. You can do that on one box no problem. It's still better to have more drives for performance though. On Mon, Aug 19, 2013 at 12:08 PM, Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at wrote: What you are trying to do will work, because you will not need any kernel related code for object storage, so a one node setup will work for you. -- Sent from my mobile device On 19.08.2013, at 20:29, Schmitt, Christian c.schm...@briefdomain.de wrote: That sounds bad for me. As said one of the things we consider is a one node setup, for production. Not every Customer will afford hardware worth more than ~4000 Euro. Small business users don't need need the biggest hardware, but i don't think it's a good way to have a version who uses the filesystem and one version who use ceph. We prefer a Object Storage for our Files. It should work like the Object Storage of the App Engine. That scales from 1 to X Servers. 2013/8/19 John Wilkins john.wilk...@inktank.com: Actually, I wrote the Quick Start guides so that you could do exactly what you are trying to do, but mostly from a kick the tires perspective so that people can learn to use Ceph without imposing $100k worth of hardware as a requirement. See http://ceph.com/docs/master/start/quick-ceph-deploy/ I even added a section so that you could do it on one disk--e.g., on your laptop. http://ceph.com/docs/master/start/quick-ceph-deploy/#multiple-osds-on-the-os-disk-demo-only It says demo only, because you won't get great performance out of a single node. Monitors, OSDs, and Journals writing to disk and fsync issues would make performance sub-optimal. For better performance, you should consider a separate drive for each Ceph OSD Daemon if you can, and potentially a separate SSD drive partitioned for journals. If you can separate the OS and monitor drives from the OSD drives, that's better too. I wrote it as a two-node quick start, because you cannot kernel mount the Ceph Filesystem or Ceph Block Devices on the same host as the Ceph Storage Cluster. It's a kernel issue, not a Ceph issue. However, you can get around this too. If your machine has enough RAM and CPU, you can also install virtual machines and kernel mount cephfs and block devices in the virtual machines with no kernel issues. You don't need to use VMs at all for librbd. So you can install QEMU/KVM, libvirt and OpenStack all on the same host too. It's just not an ideal situation from performance or high availability perspective. On Mon, Aug 19, 2013 at 3:12 AM, Schmitt, Christian c.schm...@briefdomain.de wrote: 2013/8/19 Wolfgang Hennerbichler wolfgang.hennerbich...@risc-software.at: On 08/19/2013 12:01 PM, Schmitt, Christian wrote: yes. depends on 'everything', but it's possible (though not recommended) to run mon, mds, and osd's on the same host, and even do virtualisation. Currently we don't want to virtualise on this machine since the machine is really small, as said we focus on small to midsize businesses. Most of the time they even need a tower server due to the lack of a correct rack. ;/ whoa :) Yep that's awful. Our Application, Ceph's object storage and a database? what is 'a database'? We run Postgresql or MariaDB (without/with Galera depending on the cluster size) You wouldn't want to put the data of postgres or mariadb on cephfs. I would run the native versions directly on the servers and use mysql-multi-master circular replication. I don't know about similar features of postgres. No i don't want to put a MariaDB Cluster on CephFS we want to put PDFs in CephFS or Ceph's Object Storage and hold a key or path in the database, also other things like user management will belong to the database shared nothing is possible with ceph, but in the end this really depends on your application. hm, when disk fails we already doing some backup on a dell powervault rd1000, so i don't think thats a problem and also we would run ceph on a Dell PERC Raid Controller with RAID1 enabled on the data disk. this is open to discussion, and really depends on your use case. Yeah we definitely know that it isn't good to use Ceph on a single node, but i think it's easier to design the application that it will depends on ceph. it wouldn't be easy to manage to have a single node without ceph and more than 1 node with ceph. Currently we make an archiving software for small customers and we want to move things on the file system on a object storage. you mean from the filesystem to an object storage? yes, currently everything is on the filesystem and this is really horrible, thousands of pdfs just on the filesystem. we can't scale up that easily with this setup. Got it. Currently we run on Microsoft Servers, but we plan to
Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling
Hey Samuel, Thanks! I installed your version, repeated the same tests on my test-cluster and the extra CPU-loading seems to have disappeared. Then I replaced one osd of my production-cluster with your modified version and it's config-option and it seems to be a lot less CPU-hungry now. Although the Cuttlefish-osds still seem to be even more CPU-efficient, your changes have definitely helped a lot. We seem to be looking in the right direction, at least for this part of the problem. BTW, I ran `perf top' on the production-node with your modified osd and didn't see anything osd-related stand out on top. PGLog::undirty() was in there, but with much lower usage, right at the bottom of the green part of the output. Many thanks for your help so far! Regards, Oliver On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote: You're right, PGLog::undirty() looks suspicious. I just pushed a branch wip-dumpling-pglog-undirty with a new config (osd_debug_pg_log_writeout) which if set to false will disable some strictly debugging checks which occur in PGLog::undirty(). We haven't actually seen these checks causing excessive cpu usage, so this may be a red herring. -Sam On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey oli...@xs4all.nl wrote: Hey Mark, On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote: On 08/17/2013 06:13 AM, Oliver Daudey wrote: Hey all, This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I created in the tracker. Thought I would pass it through the list as well, to get an idea if anyone else is running into it. It may only show under higher loads. More info about my setup is in the bug-report above. Here goes: I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd and mds. I'm using RBD on this cluster as storage for KVM, CephFS is unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100 +MB/sec on simple linear writes to a file with `dd' inside a VM on this cluster under regular load and the osds usually averaged 20-100% CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for the osds shot up to 100% to 400% in `top' (multi-core system) and the speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users complained that disk-access inside the VMs was significantly slower and the backups of the RBD-store I was running, also got behind quickly. After downgrading only the osds to v0.61.7 Cuttlefish and leaving the rest at 0.67 Dumpling, speed and load returned to normal. I have repeated this performance-hit upon upgrade on a similar test-cluster under no additional load at all. Although CPU-usage for the osds wasn't as dramatic during these tests because there was no base-load from other VMs, I/O-performance dropped significantly after upgrading during these tests as well, and returned to normal after downgrading the osds. I'm not sure what to make of it. There are no visible errors in the logs and everything runs and reports good health, it's just a lot slower, with a lot more CPU-usage. Hi Oliver, If you have access to the perf command on this system, could you try running: sudo perf top And if that doesn't give you much, sudo perf record -g then: sudo perf report | less during the period of high CPU usage? This will give you a call graph. There may be symbols missing, but it might help track down what the OSDs are doing. Thanks for your help! I did a couple of runs on my test-cluster, loading it with writes from 3 VMs concurrently and measuring the results at the first node with all 0.67 Dumpling-components and with the osds replaced by 0.61.7 Cuttlefish. I let `perf top' run and settle for a while, then copied anything that showed in red and green into this post. Here are the results (sorry for the word-wraps): First, with 0.61.7 osds: 19.91% [kernel][k] intel_idle 10.18% [kernel][k] _raw_spin_lock_irqsave 6.79% ceph-osd[.] ceph_crc32c_le 4.93% [kernel][k] default_send_IPI_mask_sequence_phys 2.71% [kernel][k] copy_user_generic_string 1.42% libc-2.11.3.so [.] memcpy 1.23% [kernel][k] find_busiest_group 1.13% librados.so.2.0.0 [.] ceph_crc32c_le_intel 1.11% [kernel][k] _raw_spin_lock 0.99% kvm [.] 0x1931f8 0.92% [igb] [k] igb_poll 0.87% [kernel][k] native_write_cr0 0.80% [kernel][k] csum_partial 0.78% [kernel][k] __do_softirq 0.63% [kernel][k] hpet_legacy_next_event 0.53% [ip_tables] [k] ipt_do_table 0.50% libc-2.11.3.so [.] 0x74433 Second test,
Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling
Hi Oliver, Glad that helped! How much more efficient do the cuttlefish OSDs seem at this point (with wip-dumpling-pglog-undirty)? On modern Intel platforms we were actually hoping to see CPU usage go down in many cases due to the use of hardware CRC32 instructions. Mark On 08/19/2013 03:06 PM, Oliver Daudey wrote: Hey Samuel, Thanks! I installed your version, repeated the same tests on my test-cluster and the extra CPU-loading seems to have disappeared. Then I replaced one osd of my production-cluster with your modified version and it's config-option and it seems to be a lot less CPU-hungry now. Although the Cuttlefish-osds still seem to be even more CPU-efficient, your changes have definitely helped a lot. We seem to be looking in the right direction, at least for this part of the problem. BTW, I ran `perf top' on the production-node with your modified osd and didn't see anything osd-related stand out on top. PGLog::undirty() was in there, but with much lower usage, right at the bottom of the green part of the output. Many thanks for your help so far! Regards, Oliver On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote: You're right, PGLog::undirty() looks suspicious. I just pushed a branch wip-dumpling-pglog-undirty with a new config (osd_debug_pg_log_writeout) which if set to false will disable some strictly debugging checks which occur in PGLog::undirty(). We haven't actually seen these checks causing excessive cpu usage, so this may be a red herring. -Sam On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey oli...@xs4all.nl wrote: Hey Mark, On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote: On 08/17/2013 06:13 AM, Oliver Daudey wrote: Hey all, This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I created in the tracker. Thought I would pass it through the list as well, to get an idea if anyone else is running into it. It may only show under higher loads. More info about my setup is in the bug-report above. Here goes: I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd and mds. I'm using RBD on this cluster as storage for KVM, CephFS is unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100 +MB/sec on simple linear writes to a file with `dd' inside a VM on this cluster under regular load and the osds usually averaged 20-100% CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for the osds shot up to 100% to 400% in `top' (multi-core system) and the speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users complained that disk-access inside the VMs was significantly slower and the backups of the RBD-store I was running, also got behind quickly. After downgrading only the osds to v0.61.7 Cuttlefish and leaving the rest at 0.67 Dumpling, speed and load returned to normal. I have repeated this performance-hit upon upgrade on a similar test-cluster under no additional load at all. Although CPU-usage for the osds wasn't as dramatic during these tests because there was no base-load from other VMs, I/O-performance dropped significantly after upgrading during these tests as well, and returned to normal after downgrading the osds. I'm not sure what to make of it. There are no visible errors in the logs and everything runs and reports good health, it's just a lot slower, with a lot more CPU-usage. Hi Oliver, If you have access to the perf command on this system, could you try running: sudo perf top And if that doesn't give you much, sudo perf record -g then: sudo perf report | less during the period of high CPU usage? This will give you a call graph. There may be symbols missing, but it might help track down what the OSDs are doing. Thanks for your help! I did a couple of runs on my test-cluster, loading it with writes from 3 VMs concurrently and measuring the results at the first node with all 0.67 Dumpling-components and with the osds replaced by 0.61.7 Cuttlefish. I let `perf top' run and settle for a while, then copied anything that showed in red and green into this post. Here are the results (sorry for the word-wraps): First, with 0.61.7 osds: 19.91% [kernel][k] intel_idle 10.18% [kernel][k] _raw_spin_lock_irqsave 6.79% ceph-osd[.] ceph_crc32c_le 4.93% [kernel][k] default_send_IPI_mask_sequence_phys 2.71% [kernel][k] copy_user_generic_string 1.42% libc-2.11.3.so [.] memcpy 1.23% [kernel][k] find_busiest_group 1.13% librados.so.2.0.0 [.] ceph_crc32c_le_intel 1.11% [kernel][k] _raw_spin_lock 0.99% kvm [.] 0x1931f8 0.92% [igb] [k] igb_poll 0.87% [kernel][k] native_write_cr0 0.80% [kernel][k] csum_partial 0.78% [kernel][k] __do_softirq 0.63% [kernel]
Re: [ceph-users] osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) (ceph 0.61.7)
Le lundi 19 août 2013 à 12:27 +0200, Olivier Bonvalet a écrit : Hi, I have an OSD which crash every time I try to start it (see logs below). Is it a known problem ? And is there a way to fix it ? root! taman:/var/log/ceph# grep -v ' pipe' osd.65.log 2013-08-19 11:07:48.478558 7f6fe367a780 0 ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff), process ceph-osd, pid 19327 2013-08-19 11:07:48.516363 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and appears to work 2013-08-19 11:07:48.516380 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-08-19 11:07:48.516514 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs 2013-08-19 11:07:48.517087 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully supported 2013-08-19 11:07:48.517389 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount found snaps 2013-08-19 11:07:49.199483 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2013-08-19 11:07:52.191336 7f6fe367a780 1 journal _open /dev/sdk4 fd 18: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-08-19 11:07:52.196020 7f6fe367a780 1 journal _open /dev/sdk4 fd 18: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-08-19 11:07:52.196920 7f6fe367a780 1 journal close /dev/sdk4 2013-08-19 11:07:52.199908 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is supported and appears to work 2013-08-19 11:07:52.199916 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-08-19 11:07:52.200058 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount did NOT detect btrfs 2013-08-19 11:07:52.200886 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount syscall(SYS_syncfs, fd) fully supported 2013-08-19 11:07:52.200919 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount found snaps 2013-08-19 11:07:52.215850 7f6fe367a780 0 filestore(/var/lib/ceph/osd/ceph-65) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2013-08-19 11:07:52.219819 7f6fe367a780 1 journal _open /dev/sdk4 fd 26: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-08-19 11:07:52.227420 7f6fe367a780 1 journal _open /dev/sdk4 fd 26: 53687091200 bytes, block size 4096 bytes, directio = 1, aio = 1 2013-08-19 11:07:52.500342 7f6fe367a780 0 osd.65 144201 crush map has features 262144, adjusting msgr requires for clients 2013-08-19 11:07:52.500353 7f6fe367a780 0 osd.65 144201 crush map has features 262144, adjusting msgr requires for osds 2013-08-19 11:08:13.581709 7f6fbdcb5700 -1 osd/OSD.cc: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7f6fbdcb5700 time 2013-08-19 11:08:13.579519 osd/OSD.cc: 4844: FAILED assert(_get_map_bl(epoch, bl)) ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) 1: (OSDService::get_map(unsigned int)+0x44b) [0x6f5b9b] 2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle, PG::RecoveryCtx*, std::setboost::intrusive_ptrPG, std::lessboost::intrusive_ptrPG , std::allocatorboost::intrusive_ptrPG *)+0x3c8) [0x6f8f48] 3: (OSD::process_peering_events(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x31f) [0x6f975f] 4: (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x14) [0x7391d4] 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8f8e3a] 6: (ThreadPool::WorkThread::entry()+0x10) [0x8fa0e0] 7: (()+0x6b50) [0x7f6fe3070b50] 8: (clone()+0x6d) [0x7f6fe15cba7d] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. full logs here : http://pastebin.com/RphNyLU0 Hi, still same problem with Ceph 0.61.8 : 2013-08-19 23:01:54.369609 7fdd667a4780 0 osd.65 144279 crush map has features 262144, adjusting msgr requires for osds 2013-08-19 23:01:58.315115 7fdd405de700 -1 osd/OSD.cc: In function 'OSDMapRef OSDService::get_map(epoch_t)' thread 7fdd405de700 time 2013-08-19 23:01:58.313955 osd/OSD.cc: 4847: FAILED assert(_get_map_bl(epoch, bl)) ceph version 0.61.8 (a6fdcca3bddbc9f177e4e2bf0d9cdd85006b028b) 1: (OSDService::get_map(unsigned int)+0x44b) [0x6f736b] 2: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle, PG::RecoveryCtx*, std::setboost::intrusive_ptrPG, std::lessboost::intrusive_ptrPG , std::allocatorboost::intrusive_ptrPG *)+0x3c8) [0x6fa708] 3: (OSD::process_peering_events(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x31f) [0x6faf1f] 4: (OSD::PeeringWQ::_process(std::listPG*, std::allocatorPG* const, ThreadPool::TPHandle)+0x14) [0x73a9b4] 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68a) [0x8fb69a] 6:
Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling
Hey Mark, If I look at the wip-dumpling-pglog-undirty-version with regular top, I see a slightly higher base-load on the osd, with significantly more and higher spikes in it than the Dumpling-osds. Looking with `perf top', PGLog::undirty() is still there, although pulling significantly less CPU. With the Cuttlefish-osds, I don't see it at all, even under load. That may account for the extra load I'm still seeing, but I don't know what is still going on in it and if that too can be safely disabled to save some more CPU. All in all, it's quite close and seems a bit difficult to measure. I'd say the CPU-usage with wip-dumpling-pglog-undirty is still a good 30% higher than Cuttlefish on my production-cluster. I have yet to upgrade all osds and compare performance of the cluster as a whole. Is the wip-dumpling-pglog-undirty-version considered safe enough to do so? If you have any tips for other safe benchmarks, I'll try those as well. Thanks! Regards, Oliver On ma, 2013-08-19 at 15:21 -0500, Mark Nelson wrote: Hi Oliver, Glad that helped! How much more efficient do the cuttlefish OSDs seem at this point (with wip-dumpling-pglog-undirty)? On modern Intel platforms we were actually hoping to see CPU usage go down in many cases due to the use of hardware CRC32 instructions. Mark On 08/19/2013 03:06 PM, Oliver Daudey wrote: Hey Samuel, Thanks! I installed your version, repeated the same tests on my test-cluster and the extra CPU-loading seems to have disappeared. Then I replaced one osd of my production-cluster with your modified version and it's config-option and it seems to be a lot less CPU-hungry now. Although the Cuttlefish-osds still seem to be even more CPU-efficient, your changes have definitely helped a lot. We seem to be looking in the right direction, at least for this part of the problem. BTW, I ran `perf top' on the production-node with your modified osd and didn't see anything osd-related stand out on top. PGLog::undirty() was in there, but with much lower usage, right at the bottom of the green part of the output. Many thanks for your help so far! Regards, Oliver On ma, 2013-08-19 at 00:29 -0700, Samuel Just wrote: You're right, PGLog::undirty() looks suspicious. I just pushed a branch wip-dumpling-pglog-undirty with a new config (osd_debug_pg_log_writeout) which if set to false will disable some strictly debugging checks which occur in PGLog::undirty(). We haven't actually seen these checks causing excessive cpu usage, so this may be a red herring. -Sam On Sat, Aug 17, 2013 at 2:48 PM, Oliver Daudey oli...@xs4all.nl wrote: Hey Mark, On za, 2013-08-17 at 08:16 -0500, Mark Nelson wrote: On 08/17/2013 06:13 AM, Oliver Daudey wrote: Hey all, This is a copy of Bug #6040 (http://tracker.ceph.com/issues/6040) I created in the tracker. Thought I would pass it through the list as well, to get an idea if anyone else is running into it. It may only show under higher loads. More info about my setup is in the bug-report above. Here goes: I'm running a Ceph-cluster with 3 nodes, each of which runs a mon, osd and mds. I'm using RBD on this cluster as storage for KVM, CephFS is unused at this time. While still on v0.61.7 Cuttlefish, I got 70-100 +MB/sec on simple linear writes to a file with `dd' inside a VM on this cluster under regular load and the osds usually averaged 20-100% CPU-utilisation in `top'. After the upgrade to Dumpling, CPU-usage for the osds shot up to 100% to 400% in `top' (multi-core system) and the speed for my writes with `dd' inside a VM dropped to 20-40MB/sec. Users complained that disk-access inside the VMs was significantly slower and the backups of the RBD-store I was running, also got behind quickly. After downgrading only the osds to v0.61.7 Cuttlefish and leaving the rest at 0.67 Dumpling, speed and load returned to normal. I have repeated this performance-hit upon upgrade on a similar test-cluster under no additional load at all. Although CPU-usage for the osds wasn't as dramatic during these tests because there was no base-load from other VMs, I/O-performance dropped significantly after upgrading during these tests as well, and returned to normal after downgrading the osds. I'm not sure what to make of it. There are no visible errors in the logs and everything runs and reports good health, it's just a lot slower, with a lot more CPU-usage. Hi Oliver, If you have access to the perf command on this system, could you try running: sudo perf top And if that doesn't give you much, sudo perf record -g then: sudo perf report | less during the period of high CPU usage? This will give you a call graph. There may be symbols missing, but it might help track down what the OSDs are doing. Thanks for your help! I did a couple of runs on my test-cluster, loading it
Re: [ceph-users] large memory leak on scrubbing
Hi, Is that the only slow request message you see? No. Full log: https://www.dropbox.com/s/i3ep5dcimndwvj1/slow_requests.txt.tar.gz It start from: 2013-08-16 09:43:39.662878 mon.0 10.174.81.132:6788/0 4276384 : [DBG] osd.4 10.174.81.131:6805/31460 reported failed by osd.50 10.174.81.135:6842/26019 2013-08-16 09:43:40.711911 mon.0 10.174.81.132:6788/0 4276386 : [DBG] osd.4 10.174.81.131:6805/31460 reported failed by osd.14 10.174.81.132:6836/2958 2013-08-16 09:43:41.043016 mon.0 10.174.81.132:6788/0 4276388 : [DBG] osd.4 10.174.81.131:6805/31460 reported failed by osd.13 10.174.81.132:6830/2482 2013-08-16 09:43:41.043047 mon.0 10.174.81.132:6788/0 4276389 : [INF] osd.4 10.174.81.131:6805/31460 failed (3 reports from 3 peers after 2013-08-16 09:43:56.042983 = grace 20.00) 2013-08-16 09:43:41.122326 mon.0 10.174.81.132:6788/0 4276390 : [INF] osdmap e10294: 144 osds: 143 up, 143 in 2013-08-16 09:43:38.798833 osd.4 10.174.81.131:6805/31460 913 : [WRN] 6 slow requests, 6 included below; oldest blocked for 30.190146 secs 2013-08-16 09:43:38.798843 osd.4 10.174.81.131:6805/31460 914 : [WRN] slow request 30.190146 seconds old, received at 2013-08-16 09:43:08.585504: osd_op(client.22301645.0:48987 .dir.1585245.1 [call rgw.bucket_complete_op] 16.33d5ea80) v4 currently waiting for subops from [25,133] 2013-08-16 09:43:38.798854 osd.4 10.174.81.131:6805/31460 915 : [WRN] slow request 30.189643 seconds old, received at 2013-08-16 09:43:08.586007: osd_op(client.22301855.0:49374 .dir.1585245.1 [call rgw.bucket_complete_op] 16.33d5ea80) v4 currently waiting for subops from [25,133] 2013-08-16 09:43:38.798859 osd.4 10.174.81.131:6805/31460 916 : [WRN] slow request 30.188236 seconds old, received at 2013-08-16 09:43:08.587414: osd_op(client.22307596.0:47674 .dir.1585245.1 [call rgw.bucket_complete_op] 16.33d5ea80) v4 currently waiting for subops from [25,133] 2013-08-16 09:43:38.798862 osd.4 10.174.81.131:6805/31460 917 : [WRN] slow request 30.187853 seconds old, received at 2013-08-16 09:43:08.587797: osd_op(client.22303894.0:51846 .dir.1585245.1 [call rgw.bucket_complete_op] 16.33d5ea80) v4 currently waiting for subops from [25,133] ... 2013-08-16 09:44:18.126318 mon.0 10.174.81.132:6788/0 4276427 : [INF] osd.4 10.174.81.131:6805/31460 boot ... 2013-08-16 09:44:23.215918 mon.0 10.174.81.132:6788/0 4276437 : [DBG] osd.25 10.174.81.133:6810/2961 reported failed by osd.83 10.174.81.137:6837/27963 2013-08-16 09:44:23.704769 mon.0 10.174.81.132:6788/0 4276438 : [INF] pgmap v17035051: 32424 pgs: 1 stale+active+clean+scrubbing+deep, 2 active, 31965 active+clean, 7 stale+active+clean, 29 peering, 415 active+degraded, 5 active+clean+scrubbing; 6630 GB data, 21420 GB used, 371 TB / 392 TB avail; 246065/61089697 degraded (0.403%) 2013-08-16 09:44:23.711244 mon.0 10.174.81.132:6788/0 4276439 : [DBG] osd.133 10.174.81.142:6803/21366 reported failed by osd.26 10.174.81.133:6814/3674 2013-08-16 09:44:23.713597 mon.0 10.174.81.132:6788/0 4276440 : [DBG] osd.133 10.174.81.142:6803/21366 reported failed by osd.17 10.174.81.132:6806/9188 2013-08-16 09:44:23.753952 mon.0 10.174.81.132:6788/0 4276441 : [DBG] osd.133 10.174.81.142:6803/21366 reported failed by osd.27 10.174.81.133:6822/5389 2013-08-16 09:44:23.753982 mon.0 10.174.81.132:6788/0 4276442 : [INF] osd.133 10.174.81.142:6803/21366 failed (3 reports from 3 peers after 2013-08-16 09:44:38.753913 = grace 20.00) 2013-08-16 09:47:10.229099 mon.0 10.174.81.132:6788/0 4276646 : [INF] pgmap v17035216: 32424 pgs: 32424 active+clean; 6630 GB data, 21420 GB used, 371 TB / 392 TB avail; 0B/s rd, 622KB/s wr, 85op/s Why osd's are 'reported failed' on scrubbing? -- Regards Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Flapping osd / continuously reported as failed
Hi, Yes, it definitely can as scrubbing takes locks on the PG, which will prevent reads or writes while the message is being processed (which will involve the rgw index being scanned). It is possible to tune scrubbing config for eliminate slow requests and marking osd down when large rgw bucket index is scrubbing? -- Regards Dominik ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Flapping osd / continuously reported as failed
On Mon, Aug 19, 2013 at 3:09 PM, Mostowiec Dominik dominik.mostow...@grupaonet.pl wrote: Hi, Yes, it definitely can as scrubbing takes locks on the PG, which will prevent reads or writes while the message is being processed (which will involve the rgw index being scanned). It is possible to tune scrubbing config for eliminate slow requests and marking osd down when large rgw bucket index is scrubbing? Unfortunately not, or we would have mentioned it before. :/ There are some proposals for sharding bucket indexes that would ameliorate this problem, and on Cuttlefish or Dumpling the OSD won't get marked down, but it will still block incoming requests on that object (ie, requests to access the bucket) while the scrubbing is in place. That said, that improvement might be sufficient since you haven't actually shown us how long the object scrub takes. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] v0.61.8 Cuttlefish released
On Mon, 19 Aug 2013, James Harper wrote: We've made another point release for Cuttlefish. This release contains a number of fixes that are generally not individually critical, but do trip up users from time to time, are non-intrusive, and have held up under testing. Notable changes include: * librados: fix async aio completion wakeup * librados: fix aio completion locking * librados: fix rare deadlock during shutdown Could any of these be causing the segfaults I'm seeing in tapdisk rbd? Are these fixes in dumpling? They are also in the dumpling branch and 0.67.1. They might explain it... not a slam dunk though. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage pattern and design of Ceph
Thanks Mark. What is the design considerations to break large files into 4M chunk rather than storing the large file directly? Thanks, Guang From: Mark Kirkwood mark.kirkw...@catalyst.net.nz To: Guang Yang yguan...@yahoo.com Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com Sent: Monday, August 19, 2013 5:18 PM Subject: Re: [ceph-users] Usage pattern and design of Ceph On 19/08/13 18:17, Guang Yang wrote: 3. Some industry research shows that one issue of file system is the metadata-to-data ratio, in terms of both access and storage, and some technic uses the mechanism to combine small files to large physical files to reduce the ratio (Haystack for example), if we want to use ceph to store photos, should this be a concern as Ceph use one physical file per object? If you use Ceph as a pure object store, and get and put data via the basic rados api then sure, one client data object will be stored in one Ceph 'object'. However if you use rados gateway (S3 or Swift look-alike api) then each client data object will be broken up into chunks at the rados level (typically 4M sized chunks). Regards Mark___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage pattern and design of Ceph
Thanks Greg. Some comments inline... On Sunday, August 18, 2013, Guang Yang wrote: Hi ceph-users, This is Guang and I am pretty new to ceph, glad to meet you guys in the community! After walking through some documents of Ceph, I have a couple of questions: 1. Is there any comparison between Ceph and AWS S3, in terms of the ability to handle different work-loads (from KB to GB), with corresponding performance report? Not really; any comparison would be highly biased depending on your Amazon ping and your Ceph cluster. We've got some internal benchmarks where Ceph looks good, but they're not anything we'd feel comfortable publishing. [Guang] Yeah, I mean the solely server side time regardless of the RTT impact over the comparison. 2. Looking at some industry solutions for distributed storage, GFS / Haystack / HDFS all use meta-server to store the logical-to-physical mapping within memory and avoid disk I/O lookup for file reading, is the concern valid for Ceph (in terms of latency to read file)? These are very different systems. Thanks to CRUSH, RADOS doesn't need to do any IO to find object locations; CephFS only does IO if the inode you request has fallen out of the MDS cache (not terribly likely in general). This shouldn't be an issue... [Guang] CephFS only does IO if the inode you request has fallen out of the MDS cache, my understanding is, if we use CephFS, we will need to interact with Rados twice, the first time to retrieve meta-data (file attribute, owner, etc.) and the second time to load data, and both times will need disk I/O in terms of inode and data. Is my understanding correct? The way some other storage system tried was to cache the file handle in memory, so that it can avoid the I/O to read inode in. 3. Some industry research shows that one issue of file system is the metadata-to-data ratio, in terms of both access and storage, and some technic uses the mechanism to combine small files to large physical files to reduce the ratio (Haystack for example), if we want to use ceph to store photos, should this be a concern as Ceph use one physical file per object? ...although this might be. The issue basically comes down to how many disk seeks are required to retrieve an item, and one way to reduce that number is to hack the filesystem by keeping a small number of very large files an calculating (or caching) where different objects are inside that file. Since Ceph is designed for MB-sized objects it doesn't go to these lengths to optimize that path like Haystack might (I'm not familiar with Haystack in particular). That said, you need some pretty extreme latency requirements before this becomes an issue and if you're also looking at HDFS or S3 I can't imagine you're in that ballpark. You should be fine. :) [Guang] Yep, that makes a lot sense. -Greg -- Software Engineer #42 @ http://inktank.com | http://ceph.com___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Usage pattern and design of Ceph
On 20/08/13 13:27, Guang Yang wrote: Thanks Mark. What is the design considerations to break large files into 4M chunk rather than storing the large file directly? Quoting Wolfgang from previous reply: = which is a good thing in terms of replication and OSD usage distribution ...which covers what I would have said quite well :-) Cheers Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd map issues: no such file or directory (ENOENT) AND map wrong image
Transferring this back the ceph-users. Sorry, I can't help with rbd issues. One thing I will say is that if you are mounting an rbd device with a filesystem on a machine to export ftp, you can't also export the same device via iSCSI. David Zafman Senior Developer http://www.inktank.com On Aug 19, 2013, at 8:39 PM, PJ linalin1...@gmail.com wrote: 2013/8/14 David Zafman david.zaf...@inktank.com On Aug 12, 2013, at 7:41 PM, Josh Durgin josh.dur...@inktank.com wrote: On 08/12/2013 07:18 PM, PJ wrote: If the target rbd device only map on one virtual machine, format it as ext4 and mount to two places mount /dev/rbd0 /nfs -- for nfs server usage mount /dev/rbd0 /ftp -- for ftp server usage nfs and ftp servers run on the same virtual machine. Will file system (ext4) help to handle the simultaneous access from nfs and ftp? I doubt that'll work perfectly on a normal disk, although rbd should behave the same in this case. Consider what happens when to be some issues when the same files are modified at once by the ftp and nfs servers. You could run ftp on an nfs client on a different machine safely. Modern Linux kernels will do a bind mount when a block device is mounted on 2 different directories. Think directory hard links. Simultaneous access will NOT corrupt ext4, but as Josh said modifying the same file at once by ftp and nfs isn't going produce good results. With file locking 2 nfs clients could coordinate using advisory locking. David Zafman Senior Developer http://www.inktank.com The first issue is reproduced, but there are changes to system configuration. Due to hardware shortage, we only have one physical machine installed one OSD and runs 6 virtual machines on it. Only one monitor (wistor-003) and one FTP server (wistor-004), the other virtual machines are iSCSI servers. The log size is big because when enable FTP service for a rbd device, we have a rbd map retry loop in case it fails (retry rbd map every 10 sec and last for 3 minutes). Please download monitor log from below link, https://www.dropbox.com/s/88cb9q91cjszuug/ceph-mon.wistor-003.log.zip Here are the operation steps: 1. The pool rex is created Around 2013-08-20 09:16:38~09:16:39 2. The first time to map rbd device on wistor-004 and it fails (all retries failed) Around 2013-08-20 09:17:43~09:20:46 (180 sec) 3. Tried second time and it works, but still have 9 fails in retry loop Around 2013-08-20 09:20:48~09:22:10 (82 sec) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com