>>I have 12K IOPS in this test on the block device itself. But only 100 >>filesystem transactions (=IOPS) on filesystem on the same device because the >>“flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the >>>>same “flush” operation with fio on the block device, unfortunately, so I >>have no idea what is causing that :/
AFAIK, with fio on block device with --sync=1, is doing flush after each write. I'm not sure with fio on a filesystem, but filesystem should do a fsync after file write. ----- Mail original ----- De: "Jan Schermer" <[email protected]> À: "aderumier" <[email protected]> Cc: "ceph-users" <[email protected]> Envoyé: Jeudi 9 Juillet 2015 14:43:46 Objet: Re: [ceph-users] Investigating my 100 IOPS limit The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 and higher have it for sure. I have 12K IOPS in this test on the block device itself. But only 100 filesystem transactions (=IOPS) on filesystem on the same device because the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the same “flush” operation with fio on the block device, unfortunately, so I have no idea what is causing that :/ Jan > On 09 Jul 2015, at 14:08, Alexandre DERUMIER <[email protected]> wrote: > > Hi, > I have already see bad performance with Crucial m550 ssd, 400 iops syncronous > write. > > Not sure what model of ssd do you have ? > > see this: > http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ > > > what is your result of disk directly with > > #dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync > #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 > --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test > > ? > > I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough > mode, and don't have any problem. > > > also about centos 2.6.32, I'm not sure FUA support has been backported by > redhat (since true FUA support is since 2.6.37), > so maybe it's the old barrier code. > > > ----- Mail original ----- > De: "Jan Schermer" <[email protected]> > À: "ceph-users" <[email protected]> > Envoyé: Jeudi 9 Juillet 2015 12:32:04 > Objet: [ceph-users] Investigating my 100 IOPS limit > > I hope this would be interesting for some, it nearly cost me my sanity. > > Some time ago I came here with a problem manifesting as a “100 IOPS*” limit > with the LSI controllers and some drives. > It almost drove me crazy as I could replicate the problem with ease but when > I wanted to show it to someone it was often gone. Sometimes it required fio > to write for some time for the problem to manifest again, required seemingly > conflicting settings to come up… > > Well, turns out the problem is fio calling fallocate() when creating the file > to use for this test, which doesn’t really allocate the blocks, it just > “reserves” them. > When fio writes to those blocks, the filesystem journal becomes the > bottleneck (100 IOPS* limit can be seen there with 100% utilization). > > If, however, I create the file with dd or such, those writes do _not_ end in > the journal, and the result is 10K synchronous 4K IOPS on the same drive. > If, for example, I run fio with a 1M block size, it would still do 100* IOPS > and when I then run a 4K block size test without deleting the file, it would > run at a 10K IOPS pace until it hits the first unwritten blocks - then it > slows to a crawl again. > > The same issue is present with XFS and ext3/ext4 (with default mount > options), and no matter how I create the filesystem or mount it can I avoid > this problem. The only way to avoid this problem is to mount ext4 with -o > journal_async_commit, which should be safe, but... > > I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and > Kingston SSDs in this case (interestingly, this issue does not seem to occur > on Samsung SSDs!). I think it has something to do with LSI faking a “FUA” > support for the drives (AFAIK they don’t support it so the controller must > somehow flush the cache, which is what introduces a huge latency hit). > I can’t replicate this problem on the block device itself, only on a file on > filesystem, so it might as well be a kernel/driver bug. I have a blktrace > showing the difference between the “good” and “bad” writes, but I don’t know > what the driver/controller does - I only see the write on the log device > finishing after a long 10ms. > > Could someone tell me how CEPH creates the filesystem objects? I suppose it > does fallocate() as well, right? Any way to force it to write them out > completely and not use it to get around this issue I have? > > How to replicate: > > fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write --bs=4k > --numjobs=1 --iodepth=1 --runtime=7200 --group_reporting --name=journal-test > --size=1000M --ioengine=libaio > > > * It is in fact 98 IOPS. Exactly. Not more, not less :-) > > Jan > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
