Re: [ceph-users] Investigating my 100 IOPS limit

Jan Schermer Thu, 09 Jul 2015 08:25:20 -0700

Those are very strange numbers. Is the “60” figure right?

Can you paste the full fio command and output?
Thanks


Jan

> On 09 Jul 2015, at 15:58, Alexandre DERUMIER <[email protected]> wrote:
> 
> I just tried on an intel s3700, on top of xfs
> 
> fio , with 
> - sequential syncronous 4k write iodepth=1  : 60 iops
> - sequential syncronous 4k write iodepth=32 : 2000 iops
> - random syncronous 4k write, iodepth=1 : 8000iops
> - random syncronous 4k write iodepth=32 : 18000 iops
> 
> 
> 
> ----- Mail original -----
> De: "aderumier" <[email protected]>
> À: "Jan Schermer" <[email protected]>
> Cc: "ceph-users" <[email protected]>
> Envoyé: Jeudi 9 Juillet 2015 15:50:35
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit
> 
>>> Any ideas where to look? I was hoping blktrace would show what exactly is 
>>> going on, but it just shows a synchronous write -> (10ms) -> completed 
> 
> which size is the write in this case ? 4K ? or more ? 
> 
> 
> ----- Mail original ----- 
> De: "Jan Schermer" <[email protected]> 
> À: "aderumier" <[email protected]> 
> Cc: "ceph-users" <[email protected]> 
> Envoyé: Jeudi 9 Juillet 2015 15:29:15 
> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
> 
> I tried everything: —write-barrier, —sync —fsync, —fdatasync 
> I never get the same 10ms latency. Must be something the filesystem 
> journal/log does that is special. 
> 
> Any ideas where to look? I was hoping blktrace would show what exactly is 
> going on, but it just shows a synchronous write -> (10ms) -> completed 
> 
> Jan 
> 
>> On 09 Jul 2015, at 15:26, Alexandre DERUMIER <[email protected]> wrote: 
>> 
>>>> I have 12K IOPS in this test on the block device itself. But only 100 
>>>> filesystem transactions (=IOPS) on filesystem on the same device because 
>>>> the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate 
>>>> the >>same “flush” operation with fio on the block device, unfortunately, 
>>>> so I have no idea what is causing that :/ 
>> 
>> AFAIK, with fio on block device with --sync=1, is doing flush after each 
>> write. 
>> 
>> I'm not sure with fio on a filesystem, but filesystem should do a fsync 
>> after file write. 
>> 
>> 
>> ----- Mail original ----- 
>> De: "Jan Schermer" <[email protected]> 
>> À: "aderumier" <[email protected]> 
>> Cc: "ceph-users" <[email protected]> 
>> Envoyé: Jeudi 9 Juillet 2015 14:43:46 
>> Objet: Re: [ceph-users] Investigating my 100 IOPS limit 
>> 
>> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 
>> and higher have it for sure. 
>> 
>> I have 12K IOPS in this test on the block device itself. But only 100 
>> filesystem transactions (=IOPS) on filesystem on the same device because the 
>> “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the 
>> same “flush” operation with fio on the block device, unfortunately, so I 
>> have no idea what is causing that :/ 
>> 
>> Jan 
>> 
>>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER <[email protected]> wrote: 
>>> 
>>> Hi, 
>>> I have already see bad performance with Crucial m550 ssd, 400 iops 
>>> syncronous write. 
>>> 
>>> Not sure what model of ssd do you have ? 
>>> 
>>> see this: 
>>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/
>>>  
>>> 
>>> what is your result of disk directly with 
>>> 
>>> #dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync 
>>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 
>>> --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test 
>>> 
>>> ? 
>>> 
>>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough 
>>> mode, and don't have any problem. 
>>> 
>>> 
>>> also about centos 2.6.32, I'm not sure FUA support has been backported by 
>>> redhat (since true FUA support is since 2.6.37), 
>>> so maybe it's the old barrier code. 
>>> 
>>> 
>>> ----- Mail original ----- 
>>> De: "Jan Schermer" <[email protected]> 
>>> À: "ceph-users" <[email protected]> 
>>> Envoyé: Jeudi 9 Juillet 2015 12:32:04 
>>> Objet: [ceph-users] Investigating my 100 IOPS limit 
>>> 
>>> I hope this would be interesting for some, it nearly cost me my sanity. 
>>> 
>>> Some time ago I came here with a problem manifesting as a “100 IOPS*” limit 
>>> with the LSI controllers and some drives. 
>>> It almost drove me crazy as I could replicate the problem with ease but 
>>> when I wanted to show it to someone it was often gone. Sometimes it 
>>> required fio to write for some time for the problem to manifest again, 
>>> required seemingly conflicting settings to come up… 
>>> 
>>> Well, turns out the problem is fio calling fallocate() when creating the 
>>> file to use for this test, which doesn’t really allocate the blocks, it 
>>> just “reserves” them. 
>>> When fio writes to those blocks, the filesystem journal becomes the 
>>> bottleneck (100 IOPS* limit can be seen there with 100% utilization). 
>>> 
>>> If, however, I create the file with dd or such, those writes do _not_ end 
>>> in the journal, and the result is 10K synchronous 4K IOPS on the same 
>>> drive. 
>>> If, for example, I run fio with a 1M block size, it would still do 100* 
>>> IOPS and when I then run a 4K block size test without deleting the file, it 
>>> would run at a 10K IOPS pace until it hits the first unwritten blocks - 
>>> then it slows to a crawl again. 
>>> 
>>> The same issue is present with XFS and ext3/ext4 (with default mount 
>>> options), and no matter how I create the filesystem or mount it can I avoid 
>>> this problem. The only way to avoid this problem is to mount ext4 with -o 
>>> journal_async_commit, which should be safe, but... 
>>> 
>>> I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and 
>>> Kingston SSDs in this case (interestingly, this issue does not seem to 
>>> occur on Samsung SSDs!). I think it has something to do with LSI faking a 
>>> “FUA” support for the drives (AFAIK they don’t support it so the controller 
>>> must somehow flush the cache, which is what introduces a huge latency hit). 
>>> I can’t replicate this problem on the block device itself, only on a file 
>>> on filesystem, so it might as well be a kernel/driver bug. I have a 
>>> blktrace showing the difference between the “good” and “bad” writes, but I 
>>> don’t know what the driver/controller does - I only see the write on the 
>>> log device finishing after a long 10ms. 
>>> 
>>> Could someone tell me how CEPH creates the filesystem objects? I suppose it 
>>> does fallocate() as well, right? Any way to force it to write them out 
>>> completely and not use it to get around this issue I have? 
>>> 
>>> How to replicate: 
>>> 
>>> fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write --bs=4k 
>>> --numjobs=1 --iodepth=1 --runtime=7200 --group_reporting 
>>> --name=journal-test --size=1000M --ioengine=libaio 
>>> 
>>> 
>>> * It is in fact 98 IOPS. Exactly. Not more, not less :-) 
>>> 
>>> Jan 
>>> _______________________________________________ 
>>> ceph-users mailing list 
>>> [email protected] 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> _______________________________________________ 
> ceph-users mailing list 
> [email protected] 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Investigating my 100 IOPS limit

Reply via email to