This is blktrace of the problem but this is the first time I’ve used it. It begins when “swapper” (probably because it is a dirty page and thus gets flushed?) show up.
+8 should mean 8 sectors = 4KiB? 8,160 1 55925 2.692651182 1436712 Q FWS [fio] 8,160 1 55926 2.692652285 1436712 G FWS [fio] 8,160 1 55927 2.692652923 1436712 I FWS [fio] 8,160 1 55928 2.692842413 6770 C WS 16760024 + 8 [0] 8,160 1 55929 2.692881511 6770 C WS 0 [0] 8,160 1 55930 2.693029408 0 C WS 4023248 + 8 [0] 8,160 1 55931 2.693066725 6770 C WS 0 [0] 8,160 1 55932 2.693266146 0 C WS 16328824 + 8 [0] 8,160 1 55933 2.693297033 1436712 Q FWS [fio] 8,160 1 55934 2.693297628 1436712 G FWS [fio] 8,160 1 55935 2.693298042 1436712 I FWS [fio] 8,160 1 55936 2.693478425 0 C WS 11772464 + 8 [0] 8,160 1 55937 2.693484440 1436712 Q FWS [fio] 8,160 1 55938 2.693484919 1436712 G FWS [fio] 8,160 1 55939 2.693485180 1436712 I FWS [fio] 8,160 1 55940 2.693669076 0 C WS 19637160 + 8 [0] 8,160 1 55941 2.693675679 1436712 Q FWS [fio] 8,160 1 55942 2.693676180 1436712 G FWS [fio] 8,160 1 55943 2.693676706 1436712 I FWS [fio] 8,160 1 55944 2.693898452 0 C WS 12364112 + 8 [0] 8,160 1 55945 2.693963594 0 C WS 0 [0] 8,160 1 55946 2.693968646 1436712 U N [fio] 0 8,160 1 55947 2.693990010 1436712 Q WS 1864456 + 8 [fio] 8,160 1 55948 2.693990846 1436712 G WS 1864456 + 8 [fio] 8,160 1 55949 2.693991543 1436712 I WS 1864456 + 8 [fio] 8,160 1 55950 2.693992127 1436712 D WS 1864456 + 8 [fio] 8,160 1 55951 2.693995244 1436712 U N [fio] 1 8,160 1 55952 2.694233777 1436712 Q FWFS 468960767 + 16 [fio] 8,160 1 55953 2.694234504 1436712 G FWFS 468960767 + 16 [fio] 8,160 1 55954 2.694235144 1436712 I FWFS 468960767 + 16 [fio] 8,160 1 55955 2.706561235 8989 C WFS 468960767 + 16 [0] 8,160 1 55956 2.706562115 8989 C WFS 468960767 [0] 8,160 1 55957 2.725928434 1436712 Q FWFS 468960804 + 18 [fio] 8,160 1 55958 2.725929908 1436712 G FWFS 468960804 + 18 [fio] 8,160 1 55959 2.725930980 1436712 I FWFS 468960804 + 18 [fio] 8,160 7 12 2.706817448 0 C WS 17954496 + 8 [0] 8,160 7 13 2.706892420 0 D WFS 468960783 + 10 [swapper] 8,160 7 14 2.715675674 0 C WS 3242200 + 8 [0] 8,160 7 15 2.715771745 0 D WFS 468960793 + 11 [swapper] 8,160 7 16 2.736072549 0 C WS 2058840 + 8 [0] 8,160 7 17 2.736167908 0 D WFS 468960822 + 10 [swapper] 8,160 7 18 2.746284722 0 C WS 20142728 + 8 [0] 8,160 7 19 2.746370687 0 D WFS 468960832 + 11 [swapper] Jan > On 09 Jul 2015, at 15:50, Alexandre DERUMIER <[email protected]> wrote: > >>> Any ideas where to look? I was hoping blktrace would show what exactly is >>> going on, but it just shows a synchronous write -> (10ms) -> completed > > which size is the write in this case ? 4K ? or more ? > > > ----- Mail original ----- > De: "Jan Schermer" <[email protected]> > À: "aderumier" <[email protected]> > Cc: "ceph-users" <[email protected]> > Envoyé: Jeudi 9 Juillet 2015 15:29:15 > Objet: Re: [ceph-users] Investigating my 100 IOPS limit > > I tried everything: —write-barrier, —sync —fsync, —fdatasync > I never get the same 10ms latency. Must be something the filesystem > journal/log does that is special. > > Any ideas where to look? I was hoping blktrace would show what exactly is > going on, but it just shows a synchronous write -> (10ms) -> completed > > Jan > >> On 09 Jul 2015, at 15:26, Alexandre DERUMIER <[email protected]> wrote: >> >>>> I have 12K IOPS in this test on the block device itself. But only 100 >>>> filesystem transactions (=IOPS) on filesystem on the same device because >>>> the “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate >>>> the >>same “flush” operation with fio on the block device, unfortunately, >>>> so I have no idea what is causing that :/ >> >> AFAIK, with fio on block device with --sync=1, is doing flush after each >> write. >> >> I'm not sure with fio on a filesystem, but filesystem should do a fsync >> after file write. >> >> >> ----- Mail original ----- >> De: "Jan Schermer" <[email protected]> >> À: "aderumier" <[email protected]> >> Cc: "ceph-users" <[email protected]> >> Envoyé: Jeudi 9 Juillet 2015 14:43:46 >> Objet: Re: [ceph-users] Investigating my 100 IOPS limit >> >> The old FUA code has been backported for quite some time. RHEL/CentOS 6.5 >> and higher have it for sure. >> >> I have 12K IOPS in this test on the block device itself. But only 100 >> filesystem transactions (=IOPS) on filesystem on the same device because the >> “flush” (=FUA?) operation takes 10ms to finish. I just can’t replicate the >> same “flush” operation with fio on the block device, unfortunately, so I >> have no idea what is causing that :/ >> >> Jan >> >>> On 09 Jul 2015, at 14:08, Alexandre DERUMIER <[email protected]> wrote: >>> >>> Hi, >>> I have already see bad performance with Crucial m550 ssd, 400 iops >>> syncronous write. >>> >>> Not sure what model of ssd do you have ? >>> >>> see this: >>> http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ >>> >>> >>> what is your result of disk directly with >>> >>> #dd if=randfile of=/dev/sda bs=4k count=100000 oflag=direct,dsync >>> #fio --filename=/dev/sda --direct=1 --sync=1 --rw=write --bs=4k --numjobs=1 >>> --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test >>> >>> ? >>> >>> I'm using lsi 3008 controllers with intel ssd (3500,3610,3700), passthrough >>> mode, and don't have any problem. >>> >>> >>> also about centos 2.6.32, I'm not sure FUA support has been backported by >>> redhat (since true FUA support is since 2.6.37), >>> so maybe it's the old barrier code. >>> >>> >>> ----- Mail original ----- >>> De: "Jan Schermer" <[email protected]> >>> À: "ceph-users" <[email protected]> >>> Envoyé: Jeudi 9 Juillet 2015 12:32:04 >>> Objet: [ceph-users] Investigating my 100 IOPS limit >>> >>> I hope this would be interesting for some, it nearly cost me my sanity. >>> >>> Some time ago I came here with a problem manifesting as a “100 IOPS*” limit >>> with the LSI controllers and some drives. >>> It almost drove me crazy as I could replicate the problem with ease but >>> when I wanted to show it to someone it was often gone. Sometimes it >>> required fio to write for some time for the problem to manifest again, >>> required seemingly conflicting settings to come up… >>> >>> Well, turns out the problem is fio calling fallocate() when creating the >>> file to use for this test, which doesn’t really allocate the blocks, it >>> just “reserves” them. >>> When fio writes to those blocks, the filesystem journal becomes the >>> bottleneck (100 IOPS* limit can be seen there with 100% utilization). >>> >>> If, however, I create the file with dd or such, those writes do _not_ end >>> in the journal, and the result is 10K synchronous 4K IOPS on the same >>> drive. >>> If, for example, I run fio with a 1M block size, it would still do 100* >>> IOPS and when I then run a 4K block size test without deleting the file, it >>> would run at a 10K IOPS pace until it hits the first unwritten blocks - >>> then it slows to a crawl again. >>> >>> The same issue is present with XFS and ext3/ext4 (with default mount >>> options), and no matter how I create the filesystem or mount it can I avoid >>> this problem. The only way to avoid this problem is to mount ext4 with -o >>> journal_async_commit, which should be safe, but... >>> >>> I am working on top of a CentOS 6.5 install (2.6.32 kernel), LSI HBAs and >>> Kingston SSDs in this case (interestingly, this issue does not seem to >>> occur on Samsung SSDs!). I think it has something to do with LSI faking a >>> “FUA” support for the drives (AFAIK they don’t support it so the controller >>> must somehow flush the cache, which is what introduces a huge latency hit). >>> I can’t replicate this problem on the block device itself, only on a file >>> on filesystem, so it might as well be a kernel/driver bug. I have a >>> blktrace showing the difference between the “good” and “bad” writes, but I >>> don’t know what the driver/controller does - I only see the write on the >>> log device finishing after a long 10ms. >>> >>> Could someone tell me how CEPH creates the filesystem objects? I suppose it >>> does fallocate() as well, right? Any way to force it to write them out >>> completely and not use it to get around this issue I have? >>> >>> How to replicate: >>> >>> fio --filename=/mnt/something/testfile.fio --sync=1 --rw=write --bs=4k >>> --numjobs=1 --iodepth=1 --runtime=7200 --group_reporting >>> --name=journal-test --size=1000M --ioengine=libaio >>> >>> >>> * It is in fact 98 IOPS. Exactly. Not more, not less :-) >>> >>> Jan >>> _______________________________________________ >>> ceph-users mailing list >>> [email protected] >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
