Re: Mysteriously poor write performance

Andrey Korolyov Sat, 24 Mar 2012 12:10:45 -0700

http://xdel.ru/downloads/ceph-logs-dbg/


On Fri, Mar 23, 2012 at 9:53 PM, Samuel Just <sam.j...@dreamhost.com> wrote:
> (CCing the list)
>
> Actually, can you could re-do the rados bench run with 'debug journal
> = 20' along with the other debugging?  That should give us better
> information.
>
> -Sam
>
> On Fri, Mar 23, 2012 at 5:25 AM, Andrey Korolyov <and...@xdel.ru> wrote:
>> Hi Sam,
>>
>> Can you please suggest on where to start profiling osd? If the
>> bottleneck has related to such non-complex things as directio speed,
>> I`m sure that I was able to catch it long ago, even crossing around by
>> results of other types of benchmarks at host system. I`ve just tried
>> tmpfs under both journals, it has a small boost effect, as expected
>> because of near-zero i/o delay. May be chunk distribution mechanism
>> does not work well on such small amount of nodes but right now I don`t
>> have necessary amount of hardware nodes to prove or disprove that.
>>
>> On Thu, Mar 22, 2012 at 10:40 PM, Andrey Korolyov <and...@xdel.ru> wrote:
>>> random-rw: (g=0): rw=write, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>>> Starting 1 process
>>> Jobs: 1 (f=1): [W] [100.0% done] [0K/35737K /s] [0/8725 iops] [eta 00m:00s]
>>> random-rw: (groupid=0, jobs=1): err= 0: pid=9647
>>>  write: io=163840KB, bw=37760KB/s, iops=9439, runt=  4339msec
>>>    clat (usec): min=70, max=39801, avg=104.19, stdev=324.29
>>>    bw (KB/s) : min=30480, max=43312, per=98.83%, avg=37317.00, stdev=5770.28
>>>  cpu          : usr=1.84%, sys=13.00%, ctx=40961, majf=0, minf=26
>>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>>> >=64=0.0%
>>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.0%
>>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>> >=64=0.0%
>>>     issued r/w: total=0/40960, short=0/0
>>>     lat (usec): 100=79.69%, 250=19.89%, 500=0.12%, 750=0.12%, 1000=0.11%
>>>     lat (msec): 2=0.01%, 4=0.01%, 10=0.03%, 20=0.01%, 50=0.01%
>>>
>>>
>>> On Thu, Mar 22, 2012 at 9:26 PM, Samuel Just <sam.j...@dreamhost.com> wrote:
>>>> Our journal writes are actually sequential.  Could you send FIO
>>>> results for sequential 4k writes osd.0's journal and osd.1's journal?
>>>> -Sam
>>>>
>>>> On Thu, Mar 22, 2012 at 5:21 AM, Andrey Korolyov <and...@xdel.ru> wrote:
>>>>> FIO output for journal partition, directio enabled, seems good(same
>>>>> results for ext4 on other single sata disks).
>>>>>
>>>>> random-rw: (g=0): rw=randwrite, bs=4K-4K/4K-4K, ioengine=sync, iodepth=2
>>>>> Starting 1 process
>>>>> Jobs: 1 (f=1): [w] [100.0% done] [0K/3219K /s] [0/786 iops] [eta 00m:00s]
>>>>> random-rw: (groupid=0, jobs=1): err= 0: pid=21926
>>>>>  write: io=163840KB, bw=2327KB/s, iops=581, runt= 70403msec
>>>>>    clat (usec): min=122, max=441551, avg=1714.52, stdev=7565.04
>>>>>    bw (KB/s) : min=  552, max= 3880, per=100.61%, avg=2341.23, 
>>>>> stdev=480.05
>>>>>  cpu          : usr=0.42%, sys=1.34%, ctx=40976, majf=0, minf=42
>>>>>  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, 
>>>>> >=64=0.0%
>>>>>     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>>>> >=64=0.0%
>>>>>     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
>>>>> >=64=0.0%
>>>>>     issued r/w: total=0/40960, short=0/0
>>>>>     lat (usec): 250=31.70%, 500=0.68%, 750=0.10%, 1000=0.63%
>>>>>     lat (msec): 2=41.31%, 4=20.91%, 10=4.40%, 20=0.17%, 50=0.07%
>>>>>     lat (msec): 500=0.04%
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 22, 2012 at 1:20 AM, Samuel Just <sam.j...@dreamhost.com> 
>>>>> wrote:
>>>>>> (CCing the list)
>>>>>>
>>>>>> So, the problem isn't the bandwidth.  Before we respond to the client,
>>>>>> we write the operation to the journal.  In this case, that operation
>>>>>> is taking >1s per operation on osd.1.  Both rbd and rados bench will
>>>>>> only allow a limited number of ops in flight at a time, so this
>>>>>> latency is killing your throughput.  For comparison, the latency for
>>>>>> writing to the journal on osd.0 is < .3s.  Can you measure direct io
>>>>>> latency for writes to your osd.1 journal file?
>>>>>> -Sam
>>>>>>
>>>>>> On Wed, Mar 21, 2012 at 1:56 PM, Andrey Korolyov <and...@xdel.ru> wrote:
>>>>>>> Oh, you may confuse with Zabbix metrics - y-axis means Megabytes/s,
>>>>>>> not Megabits.
>>>>>>>
>>>>>>> On Thu, Mar 22, 2012 at 12:53 AM, Andrey Korolyov <and...@xdel.ru> 
>>>>>>> wrote:
>>>>>>>> [global]
>>>>>>>>       log dir = /ceph/out
>>>>>>>>       log_file = ""
>>>>>>>>       logger dir = /ceph/log
>>>>>>>>       pid file = /ceph/out/$type$id.pid
>>>>>>>> [mds]
>>>>>>>>       pid file = /ceph/out/$name.pid
>>>>>>>>       lockdep = 1
>>>>>>>>       mds log max segments = 2
>>>>>>>> [osd]
>>>>>>>>       lockdep = 1
>>>>>>>>       filestore_xattr_use_omap = 1
>>>>>>>>       osd data = /ceph/dev/osd$id
>>>>>>>>       osd journal = /ceph/meta/journal
>>>>>>>>       osd journal size = 100
>>>>>>>> [mon]
>>>>>>>>       lockdep = 1
>>>>>>>>       mon data = /ceph/dev/mon$id
>>>>>>>> [mon.0]
>>>>>>>>       host = 172.20.1.32
>>>>>>>>       mon addr = 172.20.1.32:6789
>>>>>>>> [mon.1]
>>>>>>>>       host = 172.20.1.33
>>>>>>>>       mon addr = 172.20.1.33:6789
>>>>>>>> [mon.2]
>>>>>>>>       host = 172.20.1.35
>>>>>>>>       mon addr = 172.20.1.35:6789
>>>>>>>> [osd.0]
>>>>>>>>       host = 172.20.1.32
>>>>>>>> [osd.1]
>>>>>>>>       host = 172.20.1.33
>>>>>>>> [mds.a]
>>>>>>>>       host = 172.20.1.32
>>>>>>>>
>>>>>>>> /dev/sda1 on /ceph type ext4 (rw,barrier=0,user_xattr)
>>>>>>>> /dev/mapper/system-cephmeta on /ceph/meta type ext4 
>>>>>>>> (rw,barrier=0,user_xattr)
>>>>>>>> Simple performance tests on those fs shows ~133Mb/s for /ceph and
>>>>>>>> metadata/. Also both machines do not hold anything else which may
>>>>>>>> impact osd.
>>>>>>>>
>>>>>>>> Also please note of following:
>>>>>>>>
>>>>>>>> http://i.imgur.com/ZgFdO.png
>>>>>>>>
>>>>>>>> First two peaks are related to running rados bench, then goes cluster
>>>>>>>> recreation, automated debian install and final peaks are dd test.
>>>>>>>> Surely I can have more precise graphs, but current one probably enough
>>>>>>>> to state a situation - rbd utilizing about a quarter of possible
>>>>>>>> bandwidth(if we can count rados bench as 100%).
>>>>>>>>
>>>>>>>> On Thu, Mar 22, 2012 at 12:39 AM, Samuel Just <sam.j...@dreamhost.com> 
>>>>>>>> wrote:
>>>>>>>>> Hmm, there seem to be writes taking as long as 1.5s to hit journal on
>>>>>>>>> osd.1...  Could you post your ceph.conf?  Might there be a problem
>>>>>>>>> with the osd.1 journal disk?
>>>>>>>>> -Sam
>>>>>>>>>
>>>>>>>>> On Wed, Mar 21, 2012 at 1:25 PM, Andrey Korolyov <and...@xdel.ru> 
>>>>>>>>> wrote:
>>>>>>>>>> Oh, sorry - they probably inherited rights from log files, fixed.
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 22, 2012 at 12:17 AM, Samuel Just 
>>>>>>>>>> <sam.j...@dreamhost.com> wrote:
>>>>>>>>>>> I get 403 Forbidden when I try to download any of the files.
>>>>>>>>>>> -Sam
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 21, 2012 at 11:51 AM, Andrey Korolyov <and...@xdel.ru> 
>>>>>>>>>>> wrote:
>>>>>>>>>>>> http://xdel.ru/downloads/ceph-logs/
>>>>>>>>>>>>
>>>>>>>>>>>> 1/ contains logs related to bench initiated at the osd0 machine 
>>>>>>>>>>>> and 2/
>>>>>>>>>>>> - at osd1.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 21, 2012 at 8:54 PM, Samuel Just 
>>>>>>>>>>>> <sam.j...@dreamhost.com> wrote:
>>>>>>>>>>>>> Hmm, I'm seeing some very high latency on ops sent to osd.1.  Can 
>>>>>>>>>>>>> you
>>>>>>>>>>>>> post osd.1's logs?
>>>>>>>>>>>>> -Sam
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 21, 2012 at 3:51 AM, Andrey Korolyov <and...@xdel.ru> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Here, please: http://xdel.ru/downloads/ceph.log.gz
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Sometimes 'cur MB/s ' shows zero during rados bench, even if any 
>>>>>>>>>>>>>> debug
>>>>>>>>>>>>>> output disabled and log_file set to the empty value, hope it`s 
>>>>>>>>>>>>>> okay.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 21, 2012 at 2:36 AM, Samuel Just 
>>>>>>>>>>>>>> <sam.j...@dreamhost.com> wrote:
>>>>>>>>>>>>>>> Can you set osd and filestore debugging to 20, restart the 
>>>>>>>>>>>>>>> osds, run
>>>>>>>>>>>>>>> rados bench as before, and post the logs?
>>>>>>>>>>>>>>> -Sam Just
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Mar 20, 2012 at 1:37 PM, Andrey Korolyov 
>>>>>>>>>>>>>>> <and...@xdel.ru> wrote:
>>>>>>>>>>>>>>>> rados bench 60 write -p data
>>>>>>>>>>>>>>>> <skip>
>>>>>>>>>>>>>>>> Total time run:        61.217676
>>>>>>>>>>>>>>>> Total writes made:     989
>>>>>>>>>>>>>>>> Write size:            4194304
>>>>>>>>>>>>>>>> Bandwidth (MB/sec):    64.622
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Average Latency:       0.989608
>>>>>>>>>>>>>>>> Max latency:           2.21701
>>>>>>>>>>>>>>>> Min latency:           0.255315
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here a snip from osd log, seems write size is okay.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2012-03-21 00:00:39.397066 7fdda86a7700 osd.0 10 pg[0.58( v 
>>>>>>>>>>>>>>>> 10'83
>>>>>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 
>>>>>>>>>>>>>>>> 10'82
>>>>>>>>>>>>>>>> active+clean]  removing repgather(0x31b5360 applying 10'83 
>>>>>>>>>>>>>>>> rep_tid=597
>>>>>>>>>>>>>>>> wfack= wfdisk= op=osd_op(client.4599.0:2533 
>>>>>>>>>>>>>>>> rb.0.2.000000000040 [write
>>>>>>>>>>>>>>>> 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>>>>>> 2012-03-21 00:00:39.397086 7fdda86a7700 osd.0 10 pg[0.58( v 
>>>>>>>>>>>>>>>> 10'83
>>>>>>>>>>>>>>>> (0'0,10'83] n=50 ec=1 les/c 9/9 8/8/6) [0,1] r=0 lpr=8 mlcod 
>>>>>>>>>>>>>>>> 10'82
>>>>>>>>>>>>>>>> active+clean]    q front is repgather(0x31b5360 applying 10'83
>>>>>>>>>>>>>>>> rep_tid=597 wfack= wfdisk= op=osd_op(client.4599.0:2533
>>>>>>>>>>>>>>>> rb.0.2.000000000040 [write 1220608~4096] 0.17eb9fd8) v4)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Sorry for my previous question about rbd chunks, it was really 
>>>>>>>>>>>>>>>> stupid :)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 10:40 PM, Josh Durgin 
>>>>>>>>>>>>>>>> <josh.dur...@dreamhost.com> wrote:
>>>>>>>>>>>>>>>>> On 03/19/2012 11:13 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Nope, I`m using KVM for rbd guests. Surely I`ve been noticed 
>>>>>>>>>>>>>>>>>> that Sage
>>>>>>>>>>>>>>>>>> mentioned too small value and I`ve changed it to 64M before 
>>>>>>>>>>>>>>>>>> posting
>>>>>>>>>>>>>>>>>> previous message with no success - both 8M and this value 
>>>>>>>>>>>>>>>>>> cause a
>>>>>>>>>>>>>>>>>> performance drop. When I tried to wrote small amount of data 
>>>>>>>>>>>>>>>>>> that can
>>>>>>>>>>>>>>>>>> be compared to writeback cache size(both on raw device and 
>>>>>>>>>>>>>>>>>> ext3 with
>>>>>>>>>>>>>>>>>> sync option), following results were made:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I just want to clarify that the writeback window isn't a full 
>>>>>>>>>>>>>>>>> writeback
>>>>>>>>>>>>>>>>> cache - it doesn't affect reads, and does not help with 
>>>>>>>>>>>>>>>>> request merging etc.
>>>>>>>>>>>>>>>>> It simply allows a bunch of writes to be in flight while 
>>>>>>>>>>>>>>>>> acking the write to
>>>>>>>>>>>>>>>>> the guest immediately. We're working on a full-fledged 
>>>>>>>>>>>>>>>>> writeback cache that
>>>>>>>>>>>>>>>>> to replace the writeback window.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=10 oflag=direct 
>>>>>>>>>>>>>>>>>> (almost
>>>>>>>>>>>>>>>>>> same without oflag there and in the following samples)
>>>>>>>>>>>>>>>>>> 10+0 records in
>>>>>>>>>>>>>>>>>> 10+0 records out
>>>>>>>>>>>>>>>>>> 104857600 bytes (105 MB) copied, 0.864404 s, 121 MB/s
>>>>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=20 oflag=direct
>>>>>>>>>>>>>>>>>> 20+0 records in
>>>>>>>>>>>>>>>>>> 20+0 records out
>>>>>>>>>>>>>>>>>> 209715200 bytes (210 MB) copied, 6.67271 s, 31.4 MB/s
>>>>>>>>>>>>>>>>>> dd if=/dev/zero of=/var/img.1 bs=10M count=30 oflag=direct
>>>>>>>>>>>>>>>>>> 30+0 records in
>>>>>>>>>>>>>>>>>> 30+0 records out
>>>>>>>>>>>>>>>>>> 314572800 bytes (315 MB) copied, 12.4806 s, 25.2 MB/s
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> and so on. Reference test with bs=1M and count=2000 has 
>>>>>>>>>>>>>>>>>> slightly worse
>>>>>>>>>>>>>>>>>> results _with_ writeback cache than without, as I`ve 
>>>>>>>>>>>>>>>>>> mentioned before.
>>>>>>>>>>>>>>>>>>  Here the bench results, they`re almost equal on both nodes:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> bench: wrote 1024 MB in blocks of 4096 KB in 9.037468 sec at 
>>>>>>>>>>>>>>>>>> 113 MB/sec
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> One thing to check is the size of the writes that are 
>>>>>>>>>>>>>>>>> actually being sent by
>>>>>>>>>>>>>>>>> rbd. The guest is probably splitting them into relatively 
>>>>>>>>>>>>>>>>> small (128 or
>>>>>>>>>>>>>>>>> 256k) writes. Ideally it would be sending 4k writes, and this 
>>>>>>>>>>>>>>>>> should be a
>>>>>>>>>>>>>>>>> lot faster.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> You can see the writes being sent by adding debug_ms=1 to the 
>>>>>>>>>>>>>>>>> client or osd.
>>>>>>>>>>>>>>>>> The format is osd_op(.*[write OFFSET~LENGTH]).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Also, because I`ve not mentioned it before, network 
>>>>>>>>>>>>>>>>>> performance is
>>>>>>>>>>>>>>>>>> enough to hold fair gigabit connectivity with MTU 1500. 
>>>>>>>>>>>>>>>>>> Seems that it
>>>>>>>>>>>>>>>>>> is not interrupt problem or something like it - even if 
>>>>>>>>>>>>>>>>>> ceph-osd,
>>>>>>>>>>>>>>>>>> ethernet card queues and kvm instance pinned to different 
>>>>>>>>>>>>>>>>>> sets of
>>>>>>>>>>>>>>>>>> cores, nothing changes.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Mon, Mar 19, 2012 at 8:59 PM, Greg Farnum
>>>>>>>>>>>>>>>>>> <gregory.far...@dreamhost.com>  wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> It sounds like maybe you're using Xen? The "rbd writeback 
>>>>>>>>>>>>>>>>>>> window" option
>>>>>>>>>>>>>>>>>>> only works for userspace rbd implementations (eg, KVM).
>>>>>>>>>>>>>>>>>>> If you are using KVM, you probably want 81920000 (~80MB) 
>>>>>>>>>>>>>>>>>>> rather than
>>>>>>>>>>>>>>>>>>> 8192000 (~8MB).
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> What options are you running dd with? If you run a rados 
>>>>>>>>>>>>>>>>>>> bench from both
>>>>>>>>>>>>>>>>>>> machines, what do the results look like?
>>>>>>>>>>>>>>>>>>> Also, can you do the ceph osd bench on each of your OSDs, 
>>>>>>>>>>>>>>>>>>> please?
>>>>>>>>>>>>>>>>>>> (http://ceph.newdream.net/wiki/Troubleshooting#OSD_performance)
>>>>>>>>>>>>>>>>>>> -Greg
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Monday, March 19, 2012 at 6:46 AM, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> More strangely, writing speed drops down by fifteen 
>>>>>>>>>>>>>>>>>>>> percent when this
>>>>>>>>>>>>>>>>>>>> option was set in vm` config(instead of result from
>>>>>>>>>>>>>>>>>>>> http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg03685.html).
>>>>>>>>>>>>>>>>>>>> As I mentioned, I`m using 0.43, but due to crashed osds, 
>>>>>>>>>>>>>>>>>>>> ceph has been
>>>>>>>>>>>>>>>>>>>> recompiled with e43546dee9246773ffd6877b4f9495f1ec61cd55 
>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>> 1468d95101adfad44247016a1399aab6b86708d2 - both cases 
>>>>>>>>>>>>>>>>>>>> caused crashes
>>>>>>>>>>>>>>>>>>>> under heavy load.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Sun, Mar 18, 2012 at 10:22 PM, Sage 
>>>>>>>>>>>>>>>>>>>> Weil<s...@newdream.net
>>>>>>>>>>>>>>>>>>>> (mailto:s...@newdream.net)>  wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Sat, 17 Mar 2012, Andrey Korolyov wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I`ve did some performance tests at the following 
>>>>>>>>>>>>>>>>>>>>>> configuration:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> mon0, osd0 and mon1, osd1 - two twelve-core r410 with 
>>>>>>>>>>>>>>>>>>>>>> 32G ram, mon2 -
>>>>>>>>>>>>>>>>>>>>>> dom0 with three dedicated cores and 1.5G, mostly idle. 
>>>>>>>>>>>>>>>>>>>>>> First three
>>>>>>>>>>>>>>>>>>>>>> disks on each r410 arranged into raid0 and holds osd 
>>>>>>>>>>>>>>>>>>>>>> data when fourth
>>>>>>>>>>>>>>>>>>>>>> holds os and osd` journal partition, all ceph-related 
>>>>>>>>>>>>>>>>>>>>>> stuff mounted on
>>>>>>>>>>>>>>>>>>>>>> the ext4 without barriers.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Firstly, I`ve noticed about a difference of benchmark 
>>>>>>>>>>>>>>>>>>>>>> performance and
>>>>>>>>>>>>>>>>>>>>>> write speed through rbd from small kvm instance running 
>>>>>>>>>>>>>>>>>>>>>> on one of
>>>>>>>>>>>>>>>>>>>>>> first two machines - when bench gave me about 110Mb/s, 
>>>>>>>>>>>>>>>>>>>>>> writing zeros
>>>>>>>>>>>>>>>>>>>>>> to raw block device inside vm with dd was at top speed 
>>>>>>>>>>>>>>>>>>>>>> about 45 mb/s,
>>>>>>>>>>>>>>>>>>>>>> for vm`fs (ext4 with default options) performance drops 
>>>>>>>>>>>>>>>>>>>>>> to ~23Mb/s.
>>>>>>>>>>>>>>>>>>>>>> Things get worse, when I`ve started second vm at second 
>>>>>>>>>>>>>>>>>>>>>> host and tried
>>>>>>>>>>>>>>>>>>>>>> to continue same dd tests simultaneously - performance 
>>>>>>>>>>>>>>>>>>>>>> fairly divided
>>>>>>>>>>>>>>>>>>>>>> by half for each instance :). Enabling jumbo frames, 
>>>>>>>>>>>>>>>>>>>>>> playing with cpu
>>>>>>>>>>>>>>>>>>>>>> affinity for ceph and vm instances and trying different 
>>>>>>>>>>>>>>>>>>>>>> TCP congestion
>>>>>>>>>>>>>>>>>>>>>> protocols gave no effect at all - with DCTCP I have 
>>>>>>>>>>>>>>>>>>>>>> slightly smoother
>>>>>>>>>>>>>>>>>>>>>> network load graph and that`s all.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Can ml please suggest anything to try to improve 
>>>>>>>>>>>>>>>>>>>>>> performance?
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Can you try setting
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> rbd writeback window = 8192000
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> or similar, and see what kind of effect that has? I 
>>>>>>>>>>>>>>>>>>>>> suspect it'll speed
>>>>>>>>>>>>>>>>>>>>> up dd; I'm less sure about ext3.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>>>>>>>> sage
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> ceph-0.43, libvirt-0.9.8, qemu-1.0.0, kernel 3.2
>>>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line 
>>>>>>>>>>>>>>>>>>>>>> "unsubscribe ceph-devel"
>>>>>>>>>>>>>>>>>>>>>> in
>>>>>>>>>>>>>>>>>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>>>>>>>>>>>>>>>>>> (mailto:majord...@vger.kernel.org)
>>>>>>>>>>>>>>>>>>>>>> More majordomo info at 
>>>>>>>>>>>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>>>>>>>>>>>>>>>> ceph-devel" in
>>>>>>>>>>>>>>>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>>>>>>>>>>>>>>>> (mailto:majord...@vger.kernel.org)
>>>>>>>>>>>>>>>>>>>> More majordomo info at 
>>>>>>>>>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>>>>>>>>>>>>>> ceph-devel" in
>>>>>>>>>>>>>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>>>>>>>>>>>>>> More majordomo info at  
>>>>>>>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>>>>>>>>>>>> ceph-devel" in
>>>>>>>>>>>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>>>>>>>>>>>> More majordomo info at  
>>>>>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>>>>>>>>>>> ceph-devel" in
>>>>>>>>>>>>>>> the body of a message to majord...@vger.kernel.org
>>>>>>>>>>>>>>> More majordomo info at  
>>>>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Mysteriously poor write performance

Reply via email to