Re: [ceph-users] Investigating my 100 IOPS limit

Jan Schermer Thu, 09 Jul 2015 06:59:44 -0700

This is blktrace of the problem but this is the first time I’ve used it. It 
begins when “swapper” (probably because it is a dirty page and thus gets 
flushed?) show up.


+8 should mean 8 sectors = 4KiB?
 

 8,160  1    55925     2.692651182 1436712  Q FWS [fio]
  8,160  1    55926     2.692652285 1436712  G FWS [fio]
  8,160  1    55927     2.692652923 1436712  I FWS [fio]
  8,160  1    55928     2.692842413  6770  C  WS 16760024 + 8 [0]
  8,160  1    55929     2.692881511  6770  C  WS 0 [0]
  8,160  1    55930     2.693029408     0  C  WS 4023248 + 8 [0]
  8,160  1    55931     2.693066725  6770  C  WS 0 [0]
  8,160  1    55932     2.693266146     0  C  WS 16328824 + 8 [0]
  8,160  1    55933     2.693297033 1436712  Q FWS [fio]
  8,160  1    55934     2.693297628 1436712  G FWS [fio]
  8,160  1    55935     2.693298042 1436712  I FWS [fio]
  8,160  1    55936     2.693478425     0  C  WS 11772464 + 8 [0]
  8,160  1    55937     2.693484440 1436712  Q FWS [fio]
  8,160  1    55938     2.693484919 1436712  G FWS [fio]
  8,160  1    55939     2.693485180 1436712  I FWS [fio]
  8,160  1    55940     2.693669076     0  C  WS 19637160 + 8 [0]
  8,160  1    55941     2.693675679 1436712  Q FWS [fio]
  8,160  1    55942     2.693676180 1436712  G FWS [fio]
  8,160  1    55943     2.693676706 1436712  I FWS [fio]
  8,160  1    55944     2.693898452     0  C  WS 12364112 + 8 [0]
  8,160  1    55945     2.693963594     0  C  WS 0 [0]
  8,160  1    55946     2.693968646 1436712  U   N [fio] 0
  8,160  1    55947     2.693990010 1436712  Q  WS 1864456 + 8 [fio]
  8,160  1    55948     2.693990846 1436712  G  WS 1864456 + 8 [fio]
  8,160  1    55949     2.693991543 1436712  I  WS 1864456 + 8 [fio]
  8,160  1    55950     2.693992127 1436712  D  WS 1864456 + 8 [fio]
  8,160  1    55951     2.693995244 1436712  U   N [fio] 1
  8,160  1    55952     2.694233777 1436712  Q FWFS 468960767 + 16 [fio]
  8,160  1    55953     2.694234504 1436712  G FWFS 468960767 + 16 [fio]
  8,160  1    55954     2.694235144 1436712  I FWFS 468960767 + 16 [fio]
  8,160  1    55955     2.706561235  8989  C WFS 468960767 + 16 [0]
  8,160  1    55956     2.706562115  8989  C WFS 468960767 [0]
  8,160  1    55957     2.725928434 1436712  Q FWFS 468960804 + 18 [fio]
  8,160  1    55958     2.725929908 1436712  G FWFS 468960804 + 18 [fio]
  8,160  1    55959     2.725930980 1436712  I FWFS 468960804 + 18 [fio]
  8,160  7       12     2.706817448     0  C  WS 17954496 + 8 [0]
  8,160  7       13     2.706892420     0  D WFS 468960783 + 10 [swapper]
  8,160  7       14     2.715675674     0  C  WS 3242200 + 8 [0]
  8,160  7       15     2.715771745     0  D WFS 468960793 + 11 [swapper]
  8,160  7       16     2.736072549     0  C  WS 2058840 + 8 [0]
  8,160  7       17     2.736167908     0  D WFS 468960822 + 10 [swapper]
  8,160  7       18     2.746284722     0  C  WS 20142728 + 8 [0]
  8,160  7       19     2.746370687     0  D WFS 468960832 + 11 [swapper]

Jan

> On 09 Jul 2015, at 15:50, Alexandre DERUMIER <[email protected]> wrote:
> 
>>> Any ideas where to look? I was hoping blktrace would show what exactly is 
>>> going on, but it just shows a synchronous write -> (10ms) -> completed
> 
> which size is the write in this case ? 4K ? or more ?
> 
> 
> ----- Mail original -----
> De: "Jan Schermer" <[email protected]>
> À: "aderumier" <[email protected]>
> Cc: "ceph-users" <[email protected]>
> Envoyé: Jeudi 9 Juillet 2015 15:29:15
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit
> 
> I tried everything: —write-barrier, —sync —fsync, —fdatasync 
> I never get the same 10ms latency. Must be something the filesystem 
> journal/log does that is special. 
> 
> Any ideas where to look? I was hoping blktrace would show what exactly is 
> going on, but it just shows a synchronous write -> (10ms) -> completed 
> 
> Jan 
> 
>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER <[email protected]> wrote: 
>> 
>>>> I have 12K IOPS in this test on the block device itself. But only 100 
>>>> filesystem transactions (=IOPS) on filesystem on the same device because 
>>>> the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate 
>>>> the >>same “flush” operation with fio on the block device, unfortunately, 
>>>> so I have no idea what is causing that :/ 
>> 
>> AFAIK, with fio on block device with --sync=1, is doing flush after each 
>> write. 
>> 
>> I'm not sure with fio on a filesystem, but filesystem should do a fsync 
>> after file write. 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Jan Schermer" <[email protected]> 
>> À: "aderumier" <[email protected]> 
>> Cc: "ceph-users" <[email protected]> 
>> Envoyé: Jeudi 9 Juillet 2015 14:43:46 
>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>> 
>> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 
>> and higher have it for sure. 
>> 
>> I have 12K IOPS in this test on the block device itself. But only 100 
>> filesystem transactions (=IOPS) on filesystem on the same device because the 
>> “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the 
>> same “flush” operation with fio on the block device, unfortunately, so I 
>> have no idea what is causing that :/ 
>> 
>> Jan 
>> 
>>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER <[email protected]> wrote: 
>>> 
>>> Hi, 
>>> I have already see bad performance with Crucial m550 ssd, 400 iops 
>>> syncronous write. 
>>> 
>>> Not sure what model of ssd do you have ? 
>>> 
>>> see this: 
>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>>>  
>>> 
>>> what is your result of disk directly with 
>>> 
>>> #dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync 
>>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 
>>> --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test 
>>> 
>>> ? 
>>> 
>>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough 
>>> mode, and don't have any problem. 
>>> 
>>> 
>>> also about centos 2.6.32, I'm not sure FUA support has been backported by 
>>> redhat (since true FUA support is since 2.6.37), 
>>> so maybe it's the old barrier code. 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Jan Schermer" <[email protected]> 
>>> À: "ceph-users" <[email protected]> 
>>> Envoyé: Jeudi 9 Juillet 2015 12:32:04 
>>> Objet: [ceph-users] Investigating my 100 IOPS limit 
>>> 
>>> I hope this would be interesting for some, it nearly cost me my sanity. 
>>> 
>>> Some time ago I came here with a problem manifesting as a “100 IOPS*” limit 
>>> with the LSI controllers and some drives. 
>>> It almost drove me crazy as I could replicate the problem with ease but 
>>> when I wanted to show it to someone it was often gone. Sometimes it 
>>> required fio to write for some time for the problem to manifest again, 
>>> required seemingly conflicting settings to come up… 
>>> 
>>> Well, turns out the problem is fio calling fallocate() when creating the 
>>> file to use for this test, which doesn’t really allocate the blocks, it 
>>> just “reserves” them. 
>>> When fio writes to those blocks, the filesystem journal becomes the 
>>> bottleneck (100 IOPS* limit can be seen there with 100% utilization). 
>>> 
>>> If, however, I create the file with dd or such, those writes do _not_ end 
>>> in the journal, and the result is 10K synchronous 4K IOPS on the same 
>>> drive. 
>>> If, for example, I run fio with a 1M block size, it would still do 100* 
>>> IOPS and when I then run a 4K block size test without deleting the file, it 
>>> would run at a 10K IOPS pace until it hits the first unwritten blocks - 
>>> then it slows to a crawl again. 
>>> 
>>> The same issue is present with XFS and ext3/ext4 (with default mount 
>>> options), and no matter how I create the filesystem or mount it can I avoid 
>>> this problem. The only way to avoid this problem is to mount ext4 with -o 
>>> journal_async_commit, which should be safe, but... 
>>> 
>>> I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and 
>>> Kingston SSDs in this case (interestingly, this issue does not seem to 
>>> occur on Samsung SSDs!). I think it has something to do with LSI faking a 
>>> “FUA” support for the drives (AFAIK they don’t support it so the controller 
>>> must somehow flush the cache, which is what introduces a huge latency hit). 
>>> I can’t replicate this problem on the block device itself, only on a file 
>>> on filesystem, so it might as well be a kernel/driver bug. I have a 
>>> blktrace showing the difference between the “good” and “bad” writes, but I 
>>> don’t know what the driver/controller does - I only see the write on the 
>>> log device finishing after a long 10ms. 
>>> 
>>> Could someone tell me how CEPH creates the filesystem objects? I suppose it 
>>> does fallocate() as well, right? Any way to force it to write them out 
>>> completely and not use it to get around this issue I have? 
>>> 
>>> How to replicate: 
>>> 
>>> fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write --bs=4k 
>>> --numjobs=1 --iodepth=1 --runtime=7200 --group_reporting 
>>> --name=journal-test --size=1000M --ioengine=libaio 
>>> 
>>> 
>>> * It is in fact 98 IOPS. Exactly. Not more, not less :-) 
>>> 
>>> Jan 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> [email protected] 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Investigating my 100 IOPS limit

Reply via email to