Improving responsiveness of KVM guests on Ceph storage
Hi guys, I'm testing Ceph as storage for KVM virtual machine images and found an inconvenience that I am hoping it is possible to find the cause of. I'm running a single KVM Linux guest on top of Ceph storage. In that guest I run rsync to download files from the internet. When rsync is running, the guest will seemingly stall and run very slowly. For example if I log in via SSH to the guest and use the command prompt, nothing will happen for a long period (30+ seconds), then it processes a few typed characters, and then it blocks for another long period of time, then process a bit more, etc. I was hoping to be able to tweak the system so that it runs more like when using conventional storage - i.e. perhaps the rsync won't be super fast, but the machine will be equally responsive all the time. I'm hoping that you can provide some hints on how to best benchmark or test the system to find the cause of this? The ceph OSDs periodically logs thse two messages, that I do not fully understand: 12-12-30 17:07:12.894920 7fc8f3242700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30 2012-12-30 17:07:13.599126 7fc8cbfff700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30 Is this to be expected when the system is in use, or does it indicate that something is wrong? Ceph also logs messages such as this: 2012-12-30 17:07:36.932272 osd.0 10.0.0.1:6800/9157 286340 : [WRN] slow request 30.751940 seconds old, received at 2012-12-30 17:07:06.180236: osd_op(client.4705.0:16074961 rb.0.11b7.4a933baa.000c188f [write 532480~4096] 0.f2a63fe) v4 currently waiting for sub ops My setup: 3 servers running Fedora 17 with Ceph 0.55.1 from RPM. Each server runs one osd and one mon. One of the servers also runs an mds. Backing file system is btrfs stored on a md-raid . Journal is stored on the same SATA disks as the rests of the data. Each server has 3 bonded gigabit/sec NICs. One server running Fedora 16 with qemu-kvm. Has gigabit/sec NIC connected to the same network as the Ceph servers, and a gigabit/sec NIC connected to the Internet. Disk is mounted with: -drive format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio iostat on the KVM guest gives: avg-cpu: %user %nice %system %iowait %steal %idle 0,000,000,00 100,000,000,00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util vda 0,00 1,400,100,30 0,8013,60 36,00 1,66 2679,25 2499,75 99,99 Top on the KVM host shows 90% CPU idle and 0.0% I/O waiting. iostat on a OSD gives: avg-cpu: %user %nice %system %iowait %steal %idle 0,130,001,50 15,790,00 82,58 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 240,70 441,20 33,00 42,70 1122,40 1961,80 81,4814,45 164,42 319,14 44,85 6,63 50,22 sdb 299,10 393,10 33,90 38,40 1363,60 1720,60 85,3213,55 171,32 316,21 43,41 6,55 47,39 sdc 268,50 441,60 28,80 45,40 1191,60 1977,00 85,4119,08 159,39 345,98 41,02 6,56 48,69 sdd 255,50 445,50 30,20 45,00 1150,40 1975,80 83,1418,18 155,97 338,90 33,20 6,95 52,23 md0 0,00 0,001,20 132,70 4,80 4086,40 61,11 0,000,000,000,00 0,00 0,00 The figures are similar on all three OSDs. I am thinking that one possible cause could be that the journal is stored on the same disks as the rest of the data, but I don't know how to benchmark if this is actually the case (?) Thanks for any help or advice, you can offer! -- Jens Kristian Søgaard, Mermaid Consulting ApS, j...@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: automatic repair of inconsistent pg?
This is somewhat more likely to have been a bug in the replication logic (there were a few fixed between 0.53 and 0.55). Had there been any recent osd failures? -Sam On Mon, Dec 24, 2012 at 10:54 PM, Sage Weil s...@inktank.com wrote: On Tue, 25 Dec 2012, Stefan Priebe wrote: Hello list, today i got the following ceph status output: 2012-12-25 02:57:00.632945 mon.0 [INF] pgmap v1394388: 7632 pgs: 7631 active+clean, 1 active+clean+inconsistent; 151 GB data, 307 GB used, 5028 GB / 5336 GB avail i then grepped the inconsistent pg by: # ceph pg dump - | grep inconsistent 3.ccf 10 0 0 0 41037824155930 155930 active+clean+inconsistent 2012-12-25 01:51:35.318459 6243'2107 6190'9847 [14,42] [14,42] 6243'2107 2012-12-25 01:51:35.318436 6007'2074 2012-12-23 01:51:24.386366 and initiated a repair: # ceph pg repair 3.ccf instructing pg 3.ccf on osd.14 to repair The log output then was: 2012-12-25 02:56:59.056382 osd.14 [ERR] 3.ccf osd.42 missing 1c602ccf/rbd_data.4904d6b8b4567.0b84/head//3 2012-12-25 02:56:59.056385 osd.14 [ERR] 3.ccf osd.42 missing ceb55ccf/rbd_data.48cc66b8b4567.1538/head//3 2012-12-25 02:56:59.097989 osd.14 [ERR] 3.ccf osd.42 missing dba6bccf/rbd_data.4797d6b8b4567.15ad/head//3 2012-12-25 02:56:59.097991 osd.14 [ERR] 3.ccf osd.42 missing a4deccf/rbd_data.45f956b8b4567.03d5/head//3 2012-12-25 02:56:59.098022 osd.14 [ERR] 3.ccf repair 4 missing, 0 inconsistent objects 2012-12-25 02:56:59.098046 osd.14 [ERR] 3.ccf repair 4 errors, 4 fixed Why doesn't ceph repair this automatically? Ho could this happen at all? We just made some fixes to repair in next (it was broken sometime between ~0.53 and 0.55). The latest next should repair it. In general we don't repair automatically lest we inadvertantly propagate bad data or paper over a bug. As for the original source of the missing objects... I'm not sure. There were some fixed races related to backfill that could lead to an object being missed, but Sam would know more about how likely that actually is. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Striped images and cluster misbehavior
Sorry for the delay. A quick look at the log doesn't show anything obvious... Can you elaborate on how you caused the hang? -Sam On Wed, Dec 19, 2012 at 3:53 AM, Andrey Korolyov and...@xdel.ru wrote: Please take a look at the log below, this is slightly different bug - both osd processes on the node was stuck eating all available cpu until I killed them. This can be reproduced by doing parallel export of different from same client IP using both ``rbd export'' or API calls - after a couple of wrong ``downs'' osd.19 and osd.27 finally stuck. What is more interesting, 10.5.0.33 holds most hungry set of virtual machines, eating constantly four of twenty-four HT cores, and this node fails almost always, Underlying fs is an XFS, ceph version gf9d090e. With high possibility my previous reports are about side effects of this problem. http://xdel.ru/downloads/ceph-log/osd-19_and_27_stuck.log.gz and timings for the monmap, logs are from different hosts, so they may have a time shift of tens of milliseconds: http://xdel.ru/downloads/ceph-log/timings-crash-osd_19_and_27.txt Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: automatic repair of inconsistent pg?
Am 30.12.2012 19:17, schrieb Samuel Just: This is somewhat more likely to have been a bug in the replication logic (there were a few fixed between 0.53 and 0.55). Had there been any recent osd failures? Yes i was stressing CEPH with failures (power, link, disk, ...). Stefan On Dec 24, 2012 10:55 PM, Sage Weil s...@inktank.com mailto:s...@inktank.com wrote: On Tue, 25 Dec 2012, Stefan Priebe wrote: Hello list, today i got the following ceph status output: 2012-12-25 02:57:00.632945 mon.0 [INF] pgmap v1394388: 7632 pgs: 7631 active+clean, 1 active+clean+inconsistent; 151 GB data, 307 GB used, 5028 GB / 5336 GB avail i then grepped the inconsistent pg by: # ceph pg dump - | grep inconsistent 3.ccf 10 0 0 0 41037824155930 155930 active+clean+inconsistent 2012-12-25 01:51:35.318459 6243'2107 6190'9847 [14,42] [14,42] 6243'2107 2012-12-25 01:51:35.318436 6007'2074 2012-12-23 01:51:24.386366 and initiated a repair: # ceph pg repair 3.ccf instructing pg 3.ccf on osd.14 to repair The log output then was: 2012-12-25 02:56:59.056382 osd.14 [ERR] 3.ccf osd.42 missing 1c602ccf/rbd_data.4904d6b8b4567.0b84/head//3 2012-12-25 02:56:59.056385 osd.14 [ERR] 3.ccf osd.42 missing ceb55ccf/rbd_data.48cc66b8b4567.1538/head//3 2012-12-25 02:56:59.097989 osd.14 [ERR] 3.ccf osd.42 missing dba6bccf/rbd_data.4797d6b8b4567.15ad/head//3 2012-12-25 02:56:59.097991 osd.14 [ERR] 3.ccf osd.42 missing a4deccf/rbd_data.45f956b8b4567.03d5/head//3 2012-12-25 02:56:59.098022 osd.14 [ERR] 3.ccf repair 4 missing, 0 inconsistent objects 2012-12-25 02:56:59.098046 osd.14 [ERR] 3.ccf repair 4 errors, 4 fixed Why doesn't ceph repair this automatically? Ho could this happen at all? We just made some fixes to repair in next (it was broken sometime between ~0.53 and 0.55). The latest next should repair it. In general we don't repair automatically lest we inadvertantly propagate bad data or paper over a bug. As for the original source of the missing objects... I'm not sure. There were some fixed races related to backfill that could lead to an object being missed, but Sam would know more about how likely that actually is. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org mailto:majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Striped images and cluster misbehavior
On Sun, Dec 30, 2012 at 10:56 PM, Samuel Just sam.j...@inktank.com wrote: Sorry for the delay. A quick look at the log doesn't show anything obvious... Can you elaborate on how you caused the hang? -Sam I am sorry for all this noise, the issue almost for sure has been triggered by some bug in the Infiniband switch firmware because per-port reset was able to solve ``wrong mark'' problem - at least, it haven`t showed up yet for a week. The problem took almost two days until resolution - all possible connectivity tests displayed no overtimes or drops which can cause wrong marks. Finally, I have started playing with TCP settings and found that ipv4.tcp_low_latency raising possibility of ``wrong mark'' event several times when enabled - so area of all possible causes quickly collapsed to the media-only problem and I fixed problem soon. On Wed, Dec 19, 2012 at 3:53 AM, Andrey Korolyov and...@xdel.ru wrote: Please take a look at the log below, this is slightly different bug - both osd processes on the node was stuck eating all available cpu until I killed them. This can be reproduced by doing parallel export of different from same client IP using both ``rbd export'' or API calls - after a couple of wrong ``downs'' osd.19 and osd.27 finally stuck. What is more interesting, 10.5.0.33 holds most hungry set of virtual machines, eating constantly four of twenty-four HT cores, and this node fails almost always, Underlying fs is an XFS, ceph version gf9d090e. With high possibility my previous reports are about side effects of this problem. http://xdel.ru/downloads/ceph-log/osd-19_and_27_stuck.log.gz and timings for the monmap, logs are from different hosts, so they may have a time shift of tens of milliseconds: http://xdel.ru/downloads/ceph-log/timings-crash-osd_19_and_27.txt Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ceph for small cluster?
Hi Folks, I'm wondering how ceph would work in a small cluster that supports a mix of engineering and modest production (email, lists, web server for several small communities). Specifically, we have a rack with 4 medium-horsepower servers, each with 4 disk drives, running Xen (debian dom0 and domUs) - all linked together w/ 4 gigE ethernets. Currently, 2 of the servers are running a high-availability configuration, using DRBD to mirror specific volumes, and pacemaker for failover. For a while, I've been looking for a way to replace DRBD with something that would mirror across more than 2 servers - so that we could migrate VMs arbitrarily - and that will work without splitting up compute vs. storage nodes (for the short term, at least, we're stuck with rack space and server limitations). The thing that looks closest to filling the bill is Sheepdog (at least architecturally) - but it only provides a KVM interface. GlusterFS, xTreemFS, and Ceph keep coming up as possibles - with ceph's rbd interface looking like the easiest to integrate. Which leads me to two questions: - On a theoretical level, does using ceph as a storage pool for this kind of small cluster make any sense (notably, I'd see running an OSD, a MDS, a MON, and client DomUs on each of the 4 nodes, using LVM to pool all the storage and it seems like folks recommend XFS as a production filesystem) - On a practical level, has anybody tried building this kind of small cluster, and if so, what kind of results have you had? Comments and suggestions please! Thank you very much, Miles Fidelman -- In theory, there is no difference between theory and practice. In practice, there is. Yogi Berra -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Improving responsiveness of KVM guests on Ceph storage
On Sun, Dec 30, 2012 at 9:05 PM, Jens Kristian Søgaard j...@mermaidconsulting.dk wrote: Hi guys, I'm testing Ceph as storage for KVM virtual machine images and found an inconvenience that I am hoping it is possible to find the cause of. I'm running a single KVM Linux guest on top of Ceph storage. In that guest I run rsync to download files from the internet. When rsync is running, the guest will seemingly stall and run very slowly. For example if I log in via SSH to the guest and use the command prompt, nothing will happen for a long period (30+ seconds), then it processes a few typed characters, and then it blocks for another long period of time, then process a bit more, etc. I was hoping to be able to tweak the system so that it runs more like when using conventional storage - i.e. perhaps the rsync won't be super fast, but the machine will be equally responsive all the time. I'm hoping that you can provide some hints on how to best benchmark or test the system to find the cause of this? The ceph OSDs periodically logs thse two messages, that I do not fully understand: 12-12-30 17:07:12.894920 7fc8f3242700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30 2012-12-30 17:07:13.599126 7fc8cbfff700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30 Is this to be expected when the system is in use, or does it indicate that something is wrong? Ceph also logs messages such as this: 2012-12-30 17:07:36.932272 osd.0 10.0.0.1:6800/9157 286340 : [WRN] slow request 30.751940 seconds old, received at 2012-12-30 17:07:06.180236: osd_op(client.4705.0:16074961 rb.0.11b7.4a933baa.000c188f [write 532480~4096] 0.f2a63fe) v4 currently waiting for sub ops My setup: 3 servers running Fedora 17 with Ceph 0.55.1 from RPM. Each server runs one osd and one mon. One of the servers also runs an mds. Backing file system is btrfs stored on a md-raid . Journal is stored on the same SATA disks as the rests of the data. Each server has 3 bonded gigabit/sec NICs. One server running Fedora 16 with qemu-kvm. Has gigabit/sec NIC connected to the same network as the Ceph servers, and a gigabit/sec NIC connected to the Internet. Disk is mounted with: -drive format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio iostat on the KVM guest gives: avg-cpu: %user %nice %system %iowait %steal %idle 0,000,000,00 100,000,000,00 Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util vda 0,00 1,400,100,30 0,8013,60 36,00 1,66 2679,25 2499,75 99,99 Top on the KVM host shows 90% CPU idle and 0.0% I/O waiting. iostat on a OSD gives: avg-cpu: %user %nice %system %iowait %steal %idle 0,130,001,50 15,790,00 82,58 Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 240,70 441,20 33,00 42,70 1122,40 1961,80 81,48 14,45 164,42 319,14 44,85 6,63 50,22 sdb 299,10 393,10 33,90 38,40 1363,60 1720,60 85,32 13,55 171,32 316,21 43,41 6,55 47,39 sdc 268,50 441,60 28,80 45,40 1191,60 1977,00 85,41 19,08 159,39 345,98 41,02 6,56 48,69 sdd 255,50 445,50 30,20 45,00 1150,40 1975,80 83,14 18,18 155,97 338,90 33,20 6,95 52,23 md0 0,00 0,001,20 132,70 4,80 4086,40 61,11 0,000,000,000,00 0,00 0,00 The figures are similar on all three OSDs. I am thinking that one possible cause could be that the journal is stored on the same disks as the rest of the data, but I don't know how to benchmark if this is actually the case (?) Thanks for any help or advice, you can offer! Hi Jens, You may try do play with SCHED_RT, I have found it hard to use for myself, but you can achieve your goal by adding small RT slices via ``cpu'' cgroup to vcpu/emulator threads, it dramatically increases overall VM` responsibility. I have thrown it off because RT scheduler is a very strange thing - it may cause endless lockup on disk operation during heavy operations or produce ever-stuck ``kworker'' on some cores if you have killed VM which has separate RT slices for vcpu threads. Of course, some Ceph tuning like writeback cache and large journal may help you too, I`m speaking primarily of VM` performance by itself. -- Jens Kristian Søgaard, Mermaid Consulting ApS, j...@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Improving responsiveness of KVM guests on Ceph storage
Hi Andrey, Thanks for your reply! You may try do play with SCHED_RT, I have found it hard to use for myself, but you can achieve your goal by adding small RT slices via ``cpu'' cgroup to vcpu/emulator threads, it dramatically increases overall VM` responsibility. I'm not quite sure I understand your suggestion. Do you mean that you set the process priority to real-time on each qemu-kvm process, and then use cgroups cpu.rt_runtime_us / cpu.rt_period_us to restrict the amount of CPU time those processes can receive? I'm not sure how that would apply here, as I have only one qemu-kvm process and it is not non-responsive because of the lack of allocated CPU time slices - but rather because some I/Os take a long time to complete, and other I/Os apparently have to wait for those to complete. threads. Of course, some Ceph tuning like writeback cache and large journal may help you too, I`m speaking primarily of VM` performance by I have been considering the journal as something where I could improve performance by tweaking the setup. I have set aside 10 GB of space for the journal, but I'm not sure if this is too little - or if the size really doesn't matter that much when it is on the same mdraid as the data itself. Is there a tool that can tell me how much of my journal space that is actually actively being used? I.e. I'm looking for something that could tell me, if increasing the size of the journal or placing it on a seperate (SSD) disk could solve my problem. How do I change the size of the writeback cache when using qemu-kvm like I do? Does setting rbd cache size in ceph.conf have any effect on qemu-kvm, where the drive is defined as: format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio -- Jens Kristian Søgaard, Mermaid Consulting ApS, j...@mermaidconsulting.dk, http://www.mermaidconsulting.com/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html