On 2015/12/16 9:39, Ryan Ding wrote: > Hi Joseph, > > On 12/14/2015 06:36 PM, Joseph Qi wrote: >> Hi Ryan, >> >> On 2015/12/14 13:31, Ryan Ding wrote: >>> Hi Joseph, >>> >>> On 12/10/2015 06:36 PM, Joseph Qi wrote: >>>> Hi Ryan, >>>> >>>> On 2015/12/10 16:48, Ryan Ding wrote: >>>>> Hi Joseph, >>>>> >>>>> Thanks for your comments, please see my reply: >>>>> >>>>> On 12/10/2015 03:54 PM, Joseph Qi wrote: >>>>>> Hi Ryan, >>>>>> >>>>>> On 2015/10/12 14:34, Ryan Ding wrote: >>>>>>> Hi Joseph, >>>>>>> >>>>>>> On 10/08/2015 02:13 PM, Joseph Qi wrote: >>>>>>>> Hi Ryan, >>>>>>>> >>>>>>>> On 2015/10/8 11:12, Ryan Ding wrote: >>>>>>>>> Hi Joseph, >>>>>>>>> >>>>>>>>> On 09/28/2015 06:20 PM, Joseph Qi wrote: >>>>>>>>>> Hi Ryan, >>>>>>>>>> I have gone through this patch set and done a simple performance test >>>>>>>>>> using direct dd, it indeed brings much performance promotion. >>>>>>>>>> Before After >>>>>>>>>> bs=4K 1.4 MB/s 5.0 MB/s >>>>>>>>>> bs=256k 40.5 MB/s 56.3 MB/s >>>>>>>>>> >>>>>>>>>> My questions are: >>>>>>>>>> 1) You solution is still using orphan dir to keep inode and >>>>>>>>>> allocation >>>>>>>>>> consistency, am I right? From our test, it is the most complicated >>>>>>>>>> part >>>>>>>>>> and has many race cases to be taken consideration. So I wonder if >>>>>>>>>> this >>>>>>>>>> can be restructured. >>>>>>>>> I have not got a better idea to do this. I think the only reason why >>>>>>>>> direct io using orphan is to prevent space lost when system crash >>>>>>>>> during append direct write. But maybe a 'fsck -f' will do that job. >>>>>>>>> Is it necessary to use orphan? >>>>>>>> The idea is taken from ext4, but since ocfs2 is cluster filesystem, so >>>>>>>> it is much more complicated than ext4. >>>>>>>> And fsck can only be used offline, but using orphan is to perform >>>>>>>> recovering online. So I don't think fsck can replace it in all cases. >>>>>>>> >>>>>>>>>> 2) Rather than using normal block direct io, you introduce a way to >>>>>>>>>> use >>>>>>>>>> write begin/end in buffer io. IMO, if it wants to perform like direct >>>>>>>>>> io, it should be committed to disk by forcing committing journal. But >>>>>>>>>> journal committing will consume much time. Why does it bring >>>>>>>>>> performance >>>>>>>>>> promotion instead? >>>>>>>>> I use buffer io to write only the zero pages. Actual data payload is >>>>>>>>> written as direct io. I think there is no need to do a force commit. >>>>>>>>> Because direct means "Try to minimize cache effects of the I/O to and >>>>>>>>> from this file.", it does not means "write all data & meta data to >>>>>>>>> disk before write return". >>>>>> I think we cannot mix zero pages with direct io here, which will lead >>>>>> to direct io data to be overwritten by zero pages. >>>>>> For example, a ocfs2 volume with block size 4K and cluster size 4K. >>>>>> Firstly I create a file with size of 5K and it will be allocated 2 >>>>>> clusters (8K) and the last 3K without zeroed (no need at this time). >>>>> I think the last 3K will be zeroed no matter you use direct io or buffer >>>>> io to create the a file with 5K. >>>>>> Then I seek to offset 9K and do direct write 1K, then back to 4K and do >>>>>> direct write 5K. Here we have to zero allocated space to avoid dirty >>>>>> data. But since direct write data goes to disk directly and zero pages >>>>>> depends on journal commit, so direct write data will be overwritten and >>>>>> file corrupts. >>>>> do_blockdev_direct_IO() will zero unwritten area within block size(in >>>>> this case, 6K~8K), when get_block callback return a map with buffer_new >>>>> flag. This zero operation is also using direct io. >>>>> So the buffer io zero operation in my design will not work at all in this >>>>> case.It only works to zero the area beyond block size, but within cluster >>>>> size. For example, when block size 4KB cluster size 1MB, a 4KB direct >>>>> write will trigger a zero buffer page of size 1MB-4KB=1020KB. >>>>> I think your question is this zero buffer page will conflict with the >>>>> later direct io writing to the same area. The truth is conflict will not >>>>> exist, because before direct write, all conflict buffer page will be >>>>> flushed to disk first (in __generic_file_write_iter()). >>>> How can it make sure the zero pages to be flushed to disk first? In >>>> ocfs2_direct_IO, it calls ocfs2_dio_get_block which uses write_begin >>>> and write_end, and then __blockdev_direct_IO. >>>> I've backported your patch set to kernel 3.0 and tested with vhd-util, >>>> and the result fails. The test case is below. >>>> 1) create a 1G dynamic vhd file, the actual size is 5K. >>>> # vhd-util create -n test.vhd -s 1024 >>>> 2) resize it to 4G, the actual size becomes to 11K >>>> # vhd-util resize -n test.vhd -s 4096 -j test.log >>>> 3) hexdump the data, say hexdump1 >>>> 4) umount to commit journal and mount again, and hexdump the data again, >>>> say hexdump2, which is not equal to hexdump1. >>>> I am not sure if there is any relations with kernel version, which >>>> indeed has many differences due to refactoring. >>> I have backported it to kernel 3.8, and run the scripts below (I think it's >>> the same as your test): >>> >>> mount /dev/dm-1 /mnt >>> pushd /mnt/ >>> rm test.vhd -f >>> vhd-util create -n test.vhd -s 1024 >>> vhd-util resize -n test.vhd -s 4096 -j test.log >>> hexdump test.vhd > ~/test.hex.1 >>> popd >>> umount /mnt/ >>> mount /dev/dm-1 /mnt/ >>> hexdump /mnt/test.vhd > ~/test.hex.2 >>> umount /mnt >>> >>> block size & cluster size are all 4K. >>> It shows there is no difference between test.hex.1 and test.hex.2. I think >>> this issue is related to specified kernel version, so which version is your >>> kernel? Please provide the backport patches if you wish :) >> I am using kernel 3.0.93. But I think it have no relations with kernel. >> In one direct io, use buffer to zero first and then do direct write, you >> cannot make sure the order. In other words, direct io may goes to disk >> first and then zero buffers. That's why I am using blkdev_issue_zeroout >> to do this in my patches. >> And I am using jbd2_journal_force_commit to get metadata go to disk at >> the same time, which will make performance poorer than yours. It can be >> removed if direct io's semantics does not require. > As you can see, generic_file_direct_write() (it's in ocfs2's direct io code > path) will make sure buffer page goes first, there is no chance that direct > io and buffer io write to the same place at parallel. > And I have test it with ltp's diotest & aiodio test on both kernel 3.8 & 4.2, > there is no problem found. I think you have something wrong with the > backport, will you try your test with the newest -mm tree? IMO, generic_file_direct_write can make sure the former buffer data go first. But what I am talking here is in the same direct io. In other words, it is in the same mapping->a_ops->direct_IO call. And in ocfs2_direct_IO, you call ocfs2_dio_get_block first which will use zero buffer and then __blockdev_direct_IO. Here the order cannot be guaranteed.
Thanks, Joseph > > Thanks, > Ryan >> >>> Thanks, >>> Ryan >>>> Thanks, >>>> Joseph >>>> >>>>> BTW, there is a lot testcases to test the operations like buffer write, >>>>> direct write, lseek.. (it's a mix of these operations) in ltp (Linux Test >>>>> Project). This patch set has passed all of them. :) >>>>>>>> So this is protected by "UNWRITTEN" flag, right? >>>>>>>> >>>>>>>>>> 3) Do you have a test in case of lack of memory? >>>>>>>>> I tested it in a system with 2GB memory. Is that enough? >>>>>>>> What I mean is doing many direct io jobs in case system free memory is >>>>>>>> low. >>>>>>> I understand what you mean, but did not find a better way to test it. >>>>>>> Since if free memory is too low, even the process can not be started. >>>>>>> If free memory is fairlyenough, the test has no meaning. >>>>>>> So I try to collect the memory usage during io, and do a comparison >>>>>>> test with buffer io. The result is: >>>>>>> 1. start 100 dd to do 4KB direct write: >>>>>>> [root@hnode3 ~]# cat /proc/meminfo | grep -E >>>>>>> "^Cached|^Dirty|^MemFree|^MemTotal|^Buffers|^Writeback:" >>>>>>> MemTotal: 2809788 kB >>>>>>> MemFree: 21824 kB >>>>>>> Buffers: 55176 kB >>>>>>> Cached: 2513968 kB >>>>>>> Dirty: 412 kB >>>>>>> Writeback: 36 kB >>>>>>> >>>>>>> 2. start 100 dd to do 4KB buffer write: >>>>>>> [root@hnode3 ~]# cat /proc/meminfo | grep -E >>>>>>> "^Cached|^Dirty|^MemFree|^MemTotal|^Buffers|^Writeback:" >>>>>>> MemTotal: 2809788 kB >>>>>>> MemFree: 22476 kB >>>>>>> Buffers: 15696 kB >>>>>>> Cached: 2544892 kB >>>>>>> Dirty: 320136 kB >>>>>>> Writeback: 146404 kB >>>>>>> >>>>>>> You can see from the 'Dirty' and 'Writeback' field that there is not so >>>>>>> much memory used as buffer io. So I think what you concern is no longer >>>>>>> exist. :-) >>>>>>> >>>>>>> Thanks, >>>>>>> Ryan >>>>>>>> Thanks, >>>>>>>> Joesph >>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Ryan >>>> >>> >>> . >>> >> > > > . > _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel