Improving responsiveness of KVM guests on Ceph storage

2012-12-30 Thread Jens Kristian Søgaard

Hi guys,

I'm testing Ceph as storage for KVM virtual machine images and found an 
inconvenience that I am hoping it is possible to find the cause of.


I'm running a single KVM Linux guest on top of Ceph storage. In that 
guest I run rsync to download files from the internet. When rsync is 
running, the guest will seemingly stall and run very slowly.


For example if I log in via SSH to the guest and use the command prompt, 
nothing will happen for a long period (30+ seconds), then it processes a 
few typed characters, and then it blocks for another long period of 
time, then process a bit more, etc.


I was hoping to be able to tweak the system so that it runs more like 
when using conventional storage - i.e. perhaps the rsync won't be super 
fast, but the machine will be equally responsive all the time.


I'm hoping that you can provide some hints on how to best benchmark or 
test the system to find the cause of this?


The ceph OSDs periodically logs thse two messages, that I do not fully 
understand:


12-12-30 17:07:12.894920 7fc8f3242700  1 heartbeat_map is_healthy 
'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30
2012-12-30 17:07:13.599126 7fc8cbfff700  1 heartbeat_map reset_timeout 
'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30


Is this to be expected when the system is in use, or does it indicate 
that something is wrong?


Ceph also logs messages such as this:

2012-12-30 17:07:36.932272 osd.0 10.0.0.1:6800/9157 286340 : [WRN] slow 
request 30.751940 seconds old, received at 2012-12-30 17:07:06.180236: 
osd_op(client.4705.0:16074961 rb.0.11b7.4a933baa.000c188f [write 
532480~4096] 0.f2a63fe) v4 currently waiting for sub ops



My setup:

3 servers running Fedora 17 with Ceph 0.55.1 from RPM.
Each server runs one osd and one mon. One of the servers also runs an mds.
Backing file system is btrfs stored on a md-raid . Journal is stored on 
the same SATA disks as the rests of the data.

Each server has 3 bonded gigabit/sec NICs.

One server running Fedora 16 with qemu-kvm.
Has gigabit/sec NIC connected to the same network as the Ceph servers, 
and a gigabit/sec NIC connected to the Internet.

Disk is mounted with:

-drive format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio


iostat on the KVM guest gives:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0,000,000,00  100,000,000,00

Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s 
avgrq-sz avgqu-sz   await  svctm  %util
vda   0,00 1,400,100,30 0,8013,60 
36,00 1,66 2679,25 2499,75  99,99



Top on the KVM host shows 90% CPU idle and 0.0% I/O waiting.

iostat on a OSD gives:
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   0,130,001,50   15,790,00   82,58

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda 240,70   441,20   33,00   42,70  1122,40  1961,80 
81,4814,45  164,42  319,14   44,85   6,63  50,22
sdb 299,10   393,10   33,90   38,40  1363,60  1720,60 
85,3213,55  171,32  316,21   43,41   6,55  47,39
sdc 268,50   441,60   28,80   45,40  1191,60  1977,00 
85,4119,08  159,39  345,98   41,02   6,56  48,69
sdd 255,50   445,50   30,20   45,00  1150,40  1975,80 
83,1418,18  155,97  338,90   33,20   6,95  52,23
md0   0,00 0,001,20  132,70 4,80  4086,40 
61,11 0,000,000,000,00   0,00   0,00



The figures are similar on all three OSDs.

I am thinking that one possible cause could be that the journal is 
stored on the same disks as the rest of the data, but I don't know how 
to benchmark if this is actually the case (?)


Thanks for any help or advice, you can offer!

--
Jens Kristian Søgaard, Mermaid Consulting ApS,
j...@mermaidconsulting.dk,
http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: automatic repair of inconsistent pg?

2012-12-30 Thread Samuel Just
This is somewhat more likely to have been a bug in the replication
logic (there were a few fixed between 0.53 and 0.55).  Had there been
any recent osd failures?
-Sam

On Mon, Dec 24, 2012 at 10:54 PM, Sage Weil s...@inktank.com wrote:
 On Tue, 25 Dec 2012, Stefan Priebe wrote:
 Hello list,

 today i got the following ceph status output:
 2012-12-25 02:57:00.632945 mon.0 [INF] pgmap v1394388: 7632 pgs: 7631
 active+clean, 1 active+clean+inconsistent; 151 GB data, 307 GB used, 5028 GB 
 /
 5336 GB avail


 i then grepped the inconsistent pg by:
 # ceph pg dump - | grep inconsistent
 3.ccf   10  0   0   0   41037824155930  155930
 active+clean+inconsistent   2012-12-25 01:51:35.318459 6243'2107
 6190'9847   [14,42] [14,42] 6243'2107   2012-12-25 01:51:35.318436
 6007'2074   2012-12-23 01:51:24.386366

 and initiated a repair:
 #  ceph pg repair 3.ccf
 instructing pg 3.ccf on osd.14 to repair

 The log output then was:
 2012-12-25 02:56:59.056382 osd.14 [ERR] 3.ccf osd.42 missing
 1c602ccf/rbd_data.4904d6b8b4567.0b84/head//3
 2012-12-25 02:56:59.056385 osd.14 [ERR] 3.ccf osd.42 missing
 ceb55ccf/rbd_data.48cc66b8b4567.1538/head//3
 2012-12-25 02:56:59.097989 osd.14 [ERR] 3.ccf osd.42 missing
 dba6bccf/rbd_data.4797d6b8b4567.15ad/head//3
 2012-12-25 02:56:59.097991 osd.14 [ERR] 3.ccf osd.42 missing
 a4deccf/rbd_data.45f956b8b4567.03d5/head//3
 2012-12-25 02:56:59.098022 osd.14 [ERR] 3.ccf repair 4 missing, 0 
 inconsistent
 objects
 2012-12-25 02:56:59.098046 osd.14 [ERR] 3.ccf repair 4 errors, 4 fixed

 Why doesn't ceph repair this automatically? Ho could this happen at all?

 We just made some fixes to repair in next (it was broken sometime between
 ~0.53 and 0.55).  The latest next should repair it.  In general we don't
 repair automatically lest we inadvertantly propagate bad data or paper
 over a bug.

 As for the original source of the missing objects... I'm not sure.  There
 were some fixed races related to backfill that could lead to an object
 being missed, but Sam would know more about how likely that actually is.

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Striped images and cluster misbehavior

2012-12-30 Thread Samuel Just
Sorry for the delay.  A quick look at the log doesn't show anything
obvious... Can you elaborate on how you caused the hang?
-Sam

On Wed, Dec 19, 2012 at 3:53 AM, Andrey Korolyov and...@xdel.ru wrote:
 Please take a look at the log below, this is slightly different bug -
 both osd processes on the node was stuck eating all available cpu
 until I killed them. This can be reproduced by doing parallel export
 of different from same client IP using both ``rbd export'' or API
 calls - after a couple of wrong ``downs'' osd.19 and osd.27 finally
 stuck. What is more interesting, 10.5.0.33 holds most hungry set of
 virtual machines, eating constantly four of twenty-four HT cores, and
 this node fails almost always, Underlying fs is an XFS, ceph version
 gf9d090e. With high possibility my previous reports are about side
 effects of this problem.

 http://xdel.ru/downloads/ceph-log/osd-19_and_27_stuck.log.gz

 and timings for the monmap, logs are from different hosts, so they may
 have a time shift of tens of milliseconds:

 http://xdel.ru/downloads/ceph-log/timings-crash-osd_19_and_27.txt

 Thanks!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: automatic repair of inconsistent pg?

2012-12-30 Thread Stefan Priebe

Am 30.12.2012 19:17, schrieb Samuel Just:

This is somewhat more likely to have been a bug in the replication logic
(there were a few fixed between 0.53 and 0.55).  Had there been any
recent osd failures?


Yes i was stressing CEPH with failures (power, link, disk, ...).

Stefan


On Dec 24, 2012 10:55 PM, Sage Weil s...@inktank.com
mailto:s...@inktank.com wrote:

On Tue, 25 Dec 2012, Stefan Priebe wrote:
  Hello list,
 
  today i got the following ceph status output:
  2012-12-25 02:57:00.632945 mon.0 [INF] pgmap v1394388: 7632 pgs: 7631
  active+clean, 1 active+clean+inconsistent; 151 GB data, 307 GB
used, 5028 GB /
  5336 GB avail
 
 
  i then grepped the inconsistent pg by:
  # ceph pg dump - | grep inconsistent
  3.ccf   10  0   0   0   41037824155930
  155930
  active+clean+inconsistent   2012-12-25 01:51:35.318459 6243'2107
  6190'9847   [14,42] [14,42] 6243'2107   2012-12-25
01:51:35.318436
  6007'2074   2012-12-23 01:51:24.386366
 
  and initiated a repair:
  #  ceph pg repair 3.ccf
  instructing pg 3.ccf on osd.14 to repair
 
  The log output then was:
  2012-12-25 02:56:59.056382 osd.14 [ERR] 3.ccf osd.42 missing
  1c602ccf/rbd_data.4904d6b8b4567.0b84/head//3
  2012-12-25 02:56:59.056385 osd.14 [ERR] 3.ccf osd.42 missing
  ceb55ccf/rbd_data.48cc66b8b4567.1538/head//3
  2012-12-25 02:56:59.097989 osd.14 [ERR] 3.ccf osd.42 missing
  dba6bccf/rbd_data.4797d6b8b4567.15ad/head//3
  2012-12-25 02:56:59.097991 osd.14 [ERR] 3.ccf osd.42 missing
  a4deccf/rbd_data.45f956b8b4567.03d5/head//3
  2012-12-25 02:56:59.098022 osd.14 [ERR] 3.ccf repair 4 missing, 0
inconsistent
  objects
  2012-12-25 02:56:59.098046 osd.14 [ERR] 3.ccf repair 4 errors, 4
fixed
 
  Why doesn't ceph repair this automatically? Ho could this happen
at all?

We just made some fixes to repair in next (it was broken sometime
between
~0.53 and 0.55).  The latest next should repair it.  In general we don't
repair automatically lest we inadvertantly propagate bad data or paper
over a bug.

As for the original source of the missing objects... I'm not sure.
  There
were some fixed races related to backfill that could lead to an object
being missed, but Sam would know more about how likely that actually is.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
mailto:majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Striped images and cluster misbehavior

2012-12-30 Thread Andrey Korolyov
On Sun, Dec 30, 2012 at 10:56 PM, Samuel Just sam.j...@inktank.com wrote:
 Sorry for the delay.  A quick look at the log doesn't show anything
 obvious... Can you elaborate on how you caused the hang?
 -Sam


I am sorry for all this noise, the issue almost for sure has been
triggered by some bug in the Infiniband switch firmware because
per-port reset was able to solve ``wrong mark'' problem - at least, it
haven`t showed up yet for a week. The problem took almost two days
until resolution - all possible connectivity tests displayed no
overtimes or drops which can cause wrong marks. Finally, I have
started playing with TCP settings and found that ipv4.tcp_low_latency
raising possibility of ``wrong mark'' event several times when enabled
- so area of all possible causes quickly collapsed to the media-only
problem and I fixed problem soon.

 On Wed, Dec 19, 2012 at 3:53 AM, Andrey Korolyov and...@xdel.ru wrote:
 Please take a look at the log below, this is slightly different bug -
 both osd processes on the node was stuck eating all available cpu
 until I killed them. This can be reproduced by doing parallel export
 of different from same client IP using both ``rbd export'' or API
 calls - after a couple of wrong ``downs'' osd.19 and osd.27 finally
 stuck. What is more interesting, 10.5.0.33 holds most hungry set of
 virtual machines, eating constantly four of twenty-four HT cores, and
 this node fails almost always, Underlying fs is an XFS, ceph version
 gf9d090e. With high possibility my previous reports are about side
 effects of this problem.

 http://xdel.ru/downloads/ceph-log/osd-19_and_27_stuck.log.gz

 and timings for the monmap, logs are from different hosts, so they may
 have a time shift of tens of milliseconds:

 http://xdel.ru/downloads/ceph-log/timings-crash-osd_19_and_27.txt

 Thanks!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph for small cluster?

2012-12-30 Thread Miles Fidelman

Hi Folks,

I'm wondering how ceph would work in a small cluster that supports a mix 
of engineering and modest production (email, lists, web server for 
several small communities).


Specifically, we have a rack with 4 medium-horsepower servers, each with 
4 disk drives, running Xen (debian dom0 and domUs) - all linked together 
w/ 4 gigE ethernets.


Currently, 2 of the servers are running a high-availability 
configuration, using DRBD to mirror specific volumes, and pacemaker for 
failover.


For a while, I've been looking for a way to replace DRBD with something 
that would mirror across more than 2 servers - so that we could migrate 
VMs arbitrarily - and that will work without splitting up compute vs. 
storage nodes (for the short term, at least, we're stuck with rack space 
and server limitations).


The thing that looks closest to filling the bill is Sheepdog (at least 
architecturally) - but it only provides a KVM interface. GlusterFS, 
xTreemFS, and Ceph keep coming up as possibles - with ceph's rbd 
interface looking like the easiest to integrate.


Which leads me to two questions:

- On a theoretical level, does using ceph as a storage pool for this 
kind of small cluster make any sense (notably, I'd see running an OSD, a 
MDS, a MON, and client DomUs on each of the 4 nodes, using LVM to pool 
all the storage and it seems like folks recommend XFS as a production 
filesystem)


- On a practical level, has anybody tried building this kind of small 
cluster, and if so, what kind of results have you had?


Comments and suggestions please!

Thank you very much,

Miles Fidelman

--
In theory, there is no difference between theory and practice.
In practice, there is.    Yogi Berra

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Improving responsiveness of KVM guests on Ceph storage

2012-12-30 Thread Andrey Korolyov
On Sun, Dec 30, 2012 at 9:05 PM, Jens Kristian Søgaard
j...@mermaidconsulting.dk wrote:
 Hi guys,

 I'm testing Ceph as storage for KVM virtual machine images and found an
 inconvenience that I am hoping it is possible to find the cause of.

 I'm running a single KVM Linux guest on top of Ceph storage. In that guest I
 run rsync to download files from the internet. When rsync is running, the
 guest will seemingly stall and run very slowly.

 For example if I log in via SSH to the guest and use the command prompt,
 nothing will happen for a long period (30+ seconds), then it processes a few
 typed characters, and then it blocks for another long period of time, then
 process a bit more, etc.

 I was hoping to be able to tweak the system so that it runs more like when
 using conventional storage - i.e. perhaps the rsync won't be super fast, but
 the machine will be equally responsive all the time.

 I'm hoping that you can provide some hints on how to best benchmark or test
 the system to find the cause of this?

 The ceph OSDs periodically logs thse two messages, that I do not fully
 understand:

 12-12-30 17:07:12.894920 7fc8f3242700  1 heartbeat_map is_healthy
 'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30
 2012-12-30 17:07:13.599126 7fc8cbfff700  1 heartbeat_map reset_timeout
 'OSD::op_tp thread 0x7fc8cbfff700' had timed out after 30

 Is this to be expected when the system is in use, or does it indicate that
 something is wrong?

 Ceph also logs messages such as this:

 2012-12-30 17:07:36.932272 osd.0 10.0.0.1:6800/9157 286340 : [WRN] slow
 request 30.751940 seconds old, received at 2012-12-30 17:07:06.180236:
 osd_op(client.4705.0:16074961 rb.0.11b7.4a933baa.000c188f [write
 532480~4096] 0.f2a63fe) v4 currently waiting for sub ops


 My setup:

 3 servers running Fedora 17 with Ceph 0.55.1 from RPM.
 Each server runs one osd and one mon. One of the servers also runs an mds.
 Backing file system is btrfs stored on a md-raid . Journal is stored on the
 same SATA disks as the rests of the data.
 Each server has 3 bonded gigabit/sec NICs.

 One server running Fedora 16 with qemu-kvm.
 Has gigabit/sec NIC connected to the same network as the Ceph servers, and a
 gigabit/sec NIC connected to the Internet.
 Disk is mounted with:

 -drive format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio


 iostat on the KVM guest gives:

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0,000,000,00  100,000,000,00

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s avgrq-sz
 avgqu-sz   await  svctm  %util
 vda   0,00 1,400,100,30 0,8013,60 36,00
 1,66 2679,25 2499,75  99,99


 Top on the KVM host shows 90% CPU idle and 0.0% I/O waiting.

 iostat on a OSD gives:
 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0,130,001,50   15,790,00   82,58

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 sda 240,70   441,20   33,00   42,70  1122,40  1961,80 81,48
 14,45  164,42  319,14   44,85   6,63  50,22
 sdb 299,10   393,10   33,90   38,40  1363,60  1720,60 85,32
 13,55  171,32  316,21   43,41   6,55  47,39
 sdc 268,50   441,60   28,80   45,40  1191,60  1977,00 85,41
 19,08  159,39  345,98   41,02   6,56  48,69
 sdd 255,50   445,50   30,20   45,00  1150,40  1975,80 83,14
 18,18  155,97  338,90   33,20   6,95  52,23
 md0   0,00 0,001,20  132,70 4,80  4086,40 61,11
 0,000,000,000,00   0,00   0,00


 The figures are similar on all three OSDs.

 I am thinking that one possible cause could be that the journal is stored on
 the same disks as the rest of the data, but I don't know how to benchmark if
 this is actually the case (?)

 Thanks for any help or advice, you can offer!

Hi Jens,

You may try do play with SCHED_RT, I have found it hard to use for
myself, but you can achieve your goal by adding small RT slices via
``cpu'' cgroup to vcpu/emulator threads, it dramatically increases
overall VM` responsibility. I have thrown it off because RT scheduler
is a very strange thing - it may cause endless lockup on disk
operation during heavy operations or produce ever-stuck ``kworker'' on
some cores if you have killed VM which has separate RT slices for vcpu
threads. Of course, some Ceph tuning like writeback cache and large
journal may help you too, I`m speaking primarily of VM` performance by
itself.


 --
 Jens Kristian Søgaard, Mermaid Consulting ApS,
 j...@mermaidconsulting.dk,
 http://www.mermaidconsulting.com/
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at 

Re: Improving responsiveness of KVM guests on Ceph storage

2012-12-30 Thread Jens Kristian Søgaard

Hi Andrey,

Thanks for your reply!


You may try do play with SCHED_RT, I have found it hard to use for
myself, but you can achieve your goal by adding small RT slices via
``cpu'' cgroup to vcpu/emulator threads, it dramatically increases
overall VM` responsibility.


I'm not quite sure I understand your suggestion.

Do you mean that you set the process priority to real-time on each 
qemu-kvm process, and then use cgroups cpu.rt_runtime_us / 
cpu.rt_period_us to restrict the amount of CPU time those processes can 
receive?


I'm not sure how that would apply here, as I have only one qemu-kvm 
process and it is not non-responsive because of the lack of allocated 
CPU time slices - but rather because some I/Os take a long time to 
complete, and other I/Os apparently have to wait for those to complete.



threads. Of course, some Ceph tuning like writeback cache and large
journal may help you too, I`m speaking primarily of VM` performance by


I have been considering the journal as something where I could improve 
performance by tweaking the setup. I have set aside 10 GB of space for 
the journal, but I'm not sure if this is too little - or if the size 
really doesn't matter that much when it is on the same mdraid as the 
data itself.


Is there a tool that can tell me how much of my journal space that is 
actually actively being used?


I.e. I'm looking for something that could tell me, if increasing the 
size of the journal or placing it on a seperate (SSD) disk could solve 
my problem.


How do I change the size of the writeback cache when using qemu-kvm like 
I do?


Does setting rbd cache size in ceph.conf have any effect on qemu-kvm, 
where the drive is defined as:


  format=rbd,file=rbd:data/image1:rbd_cache=1,if=virtio

--
Jens Kristian Søgaard, Mermaid Consulting ApS,
j...@mermaidconsulting.dk,
http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html