Re: [Ocfs2-devel] [PATCH 0/8] ocfs2: fix ocfs2 direct io code patch to support sparse file and data ordering semantics

Joseph Qi Tue, 15 Dec 2015 18:29:37 -0800

On 2015/12/16 9:39, Ryan Ding wrote:
> Hi Joseph,
> 
> On 12/14/2015 06:36 PM, Joseph Qi wrote:
>> Hi Ryan,
>>
>> On 2015/12/14 13:31, Ryan Ding wrote:
>>> Hi Joseph,
>>>
>>> On 12/10/2015 06:36 PM, Joseph Qi wrote:
>>>> Hi Ryan,
>>>>
>>>> On 2015/12/10 16:48, Ryan Ding wrote:
>>>>> Hi Joseph,
>>>>>
>>>>> Thanks for your comments, please see my reply:
>>>>>
>>>>> On 12/10/2015 03:54 PM, Joseph Qi wrote:
>>>>>> Hi Ryan,
>>>>>>
>>>>>> On 2015/10/12 14:34, Ryan Ding wrote:
>>>>>>> Hi Joseph,
>>>>>>>
>>>>>>> On 10/08/2015 02:13 PM, Joseph Qi wrote:
>>>>>>>> Hi Ryan,
>>>>>>>>
>>>>>>>> On 2015/10/8 11:12, Ryan Ding wrote:
>>>>>>>>> Hi Joseph,
>>>>>>>>>
>>>>>>>>> On 09/28/2015 06:20 PM, Joseph Qi wrote:
>>>>>>>>>> Hi Ryan,
>>>>>>>>>> I have gone through this patch set and done a simple performance test
>>>>>>>>>> using direct dd, it indeed brings much performance promotion.
>>>>>>>>>>               Before      After
>>>>>>>>>> bs=4K    1.4 MB/s    5.0 MB/s
>>>>>>>>>> bs=256k  40.5 MB/s   56.3 MB/s
>>>>>>>>>>
>>>>>>>>>> My questions are:
>>>>>>>>>> 1) You solution is still using orphan dir to keep inode and 
>>>>>>>>>> allocation
>>>>>>>>>> consistency, am I right? From our test, it is the most complicated 
>>>>>>>>>> part
>>>>>>>>>> and has many race cases to be taken consideration. So I wonder if 
>>>>>>>>>> this
>>>>>>>>>> can be restructured.
>>>>>>>>> I have not got a better idea to do this. I think the only reason why 
>>>>>>>>> direct io using orphan is to prevent space lost when system crash 
>>>>>>>>> during append direct write. But maybe a 'fsck -f' will do that job. 
>>>>>>>>> Is it necessary to use orphan?
>>>>>>>> The idea is taken from ext4, but since ocfs2 is cluster filesystem, so
>>>>>>>> it is much more complicated than ext4.
>>>>>>>> And fsck can only be used offline, but using orphan is to perform
>>>>>>>> recovering online. So I don't think fsck can replace it in all cases.
>>>>>>>>
>>>>>>>>>> 2) Rather than using normal block direct io, you introduce a way to 
>>>>>>>>>> use
>>>>>>>>>> write begin/end in buffer io. IMO, if it wants to perform like direct
>>>>>>>>>> io, it should be committed to disk by forcing committing journal. But
>>>>>>>>>> journal committing will consume much time. Why does it bring 
>>>>>>>>>> performance
>>>>>>>>>> promotion instead?
>>>>>>>>> I use buffer io to write only the zero pages. Actual data payload is 
>>>>>>>>> written as direct io. I think there is no need to do a force commit. 
>>>>>>>>> Because direct means "Try to minimize cache effects of the I/O to and 
>>>>>>>>> from this file.", it does not means "write all data & meta data to 
>>>>>>>>> disk before write return".
>>>>>> I think we cannot mix zero pages with direct io here, which will lead
>>>>>> to direct io data to be overwritten by zero pages.
>>>>>> For example, a ocfs2 volume with block size 4K and cluster size 4K.
>>>>>> Firstly I create a file with size of 5K and it will be allocated 2
>>>>>> clusters (8K) and the last 3K without zeroed (no need at this time).
>>>>> I think the last 3K will be zeroed no matter you use direct io or buffer 
>>>>> io to create the a file with 5K.
>>>>>> Then I seek to offset 9K and do direct write 1K, then back to 4K and do
>>>>>> direct write 5K. Here we have to zero allocated space to avoid dirty
>>>>>> data. But since direct write data goes to disk directly and zero pages
>>>>>> depends on journal commit, so direct write data will be overwritten and
>>>>>> file corrupts.
>>>>> do_blockdev_direct_IO() will zero unwritten area within block size(in 
>>>>> this case, 6K~8K), when get_block callback return a map with buffer_new 
>>>>> flag. This zero operation is also using direct io.
>>>>> So the buffer io zero operation in my design will not work at all in this 
>>>>> case.It only works to zero the area beyond block size, but within cluster 
>>>>> size. For example, when block size 4KB cluster size 1MB, a 4KB direct 
>>>>> write will trigger a zero buffer page of size 1MB-4KB=1020KB.
>>>>> I think your question is this zero buffer page will conflict with the 
>>>>> later direct io writing to the same area. The truth is conflict will not 
>>>>> exist, because before direct write, all conflict buffer page will be 
>>>>> flushed to disk first (in __generic_file_write_iter()).
>>>> How can it make sure the zero pages to be flushed to disk first? In
>>>> ocfs2_direct_IO, it calls ocfs2_dio_get_block which uses write_begin
>>>> and write_end, and then __blockdev_direct_IO.
>>>> I've backported your patch set to kernel 3.0 and tested with vhd-util,
>>>> and the result fails. The test case is below.
>>>> 1) create a 1G dynamic vhd file, the actual size is 5K.
>>>> # vhd-util create -n test.vhd -s 1024
>>>> 2) resize it to 4G, the actual size becomes to 11K
>>>> # vhd-util resize -n test.vhd -s 4096 -j test.log
>>>> 3) hexdump the data, say hexdump1
>>>> 4) umount to commit journal and mount again, and hexdump the data again,
>>>> say hexdump2, which is not equal to hexdump1.
>>>> I am not sure if there is any relations with kernel version, which
>>>> indeed has many differences due to refactoring.
>>> I have backported it to kernel 3.8, and run the scripts below (I think it's 
>>> the same as your test):
>>>
>>>      mount /dev/dm-1 /mnt
>>>      pushd /mnt/
>>>      rm test.vhd -f
>>>      vhd-util create -n test.vhd -s 1024
>>>      vhd-util resize -n test.vhd -s 4096 -j test.log
>>>      hexdump test.vhd > ~/test.hex.1
>>>      popd
>>>      umount /mnt/
>>>      mount /dev/dm-1 /mnt/
>>>      hexdump /mnt/test.vhd > ~/test.hex.2
>>>      umount /mnt
>>>
>>> block size & cluster size are all 4K.
>>> It shows there is no difference between test.hex.1 and test.hex.2. I think 
>>> this issue is related to specified kernel version, so which version is your 
>>> kernel? Please provide the backport patches if you wish :)
>> I am using kernel 3.0.93. But I think it have no relations with kernel.
>> In one direct io, use buffer to zero first and then do direct write, you
>> cannot make sure the order. In other words, direct io may goes to disk
>> first and then zero buffers. That's why I am using blkdev_issue_zeroout
>> to do this in my patches.
>> And I am using jbd2_journal_force_commit to get metadata go to disk at
>> the same time, which will make performance poorer than yours. It can be
>> removed if direct io's semantics does not require.
> As you can see, generic_file_direct_write() (it's in ocfs2's direct io code 
> path) will make sure buffer page goes first, there is no chance that direct 
> io and buffer io write to the same place at parallel.
> And I have test it with ltp's diotest & aiodio test on both kernel 3.8 & 4.2, 
> there is no problem found. I think you have something wrong with the 
> backport, will you try your test with the newest -mm tree?
IMO, generic_file_direct_write can make sure the former buffer data go
first. But what I am talking here is in the same direct io. In other words,
it is in the same mapping->a_ops->direct_IO call. And in ocfs2_direct_IO,
you call ocfs2_dio_get_block first which will use zero buffer and then
__blockdev_direct_IO. Here the order cannot be guaranteed.


Thanks,
Joseph

> 
> Thanks,
> Ryan
>>
>>> Thanks,
>>> Ryan
>>>> Thanks,
>>>> Joseph
>>>>
>>>>> BTW, there is a lot testcases to test the operations like buffer write, 
>>>>> direct write, lseek.. (it's a mix of these operations) in ltp (Linux Test 
>>>>> Project). This patch set has passed all of them. :)
>>>>>>>> So this is protected by "UNWRITTEN" flag, right?
>>>>>>>>
>>>>>>>>>> 3) Do you have a test in case of lack of memory?
>>>>>>>>> I tested it in a system with 2GB memory. Is that enough?
>>>>>>>> What I mean is doing many direct io jobs in case system free memory is
>>>>>>>> low.
>>>>>>> I understand what you mean, but did not find a better way to test it. 
>>>>>>> Since if free memory is too low, even the process can not be started. 
>>>>>>> If free memory is fairlyenough, the test has no meaning.
>>>>>>> So I try to collect the memory usage during io, and do a comparison 
>>>>>>> test with buffer io. The result is:
>>>>>>> 1. start 100 dd to do 4KB direct write:
>>>>>>> [root@hnode3 ~]# cat /proc/meminfo | grep -E 
>>>>>>> "^Cached|^Dirty|^MemFree|^MemTotal|^Buffers|^Writeback:"
>>>>>>> MemTotal:        2809788 kB
>>>>>>> MemFree:           21824 kB
>>>>>>> Buffers:           55176 kB
>>>>>>> Cached:          2513968 kB
>>>>>>> Dirty:               412 kB
>>>>>>> Writeback:            36 kB
>>>>>>>
>>>>>>> 2. start 100 dd to do 4KB buffer write:
>>>>>>> [root@hnode3 ~]# cat /proc/meminfo | grep -E 
>>>>>>> "^Cached|^Dirty|^MemFree|^MemTotal|^Buffers|^Writeback:"
>>>>>>> MemTotal:        2809788 kB
>>>>>>> MemFree:           22476 kB
>>>>>>> Buffers:           15696 kB
>>>>>>> Cached:          2544892 kB
>>>>>>> Dirty:            320136 kB
>>>>>>> Writeback:        146404 kB
>>>>>>>
>>>>>>> You can see from the 'Dirty' and 'Writeback' field that there is not so 
>>>>>>> much memory used as buffer io. So I think what you concern is no longer 
>>>>>>> exist. :-)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ryan
>>>>>>>> Thanks,
>>>>>>>> Joesph
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Ryan
>>>>
>>>
>>> .
>>>
>>
> 
> 
> .
> 



_______________________________________________
Ocfs2-devel mailing list
Ocfs2-devel@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-devel

Re: [Ocfs2-devel] [PATCH 0/8] ocfs2: fix ocfs2 direct io code patch to support sparse file and data ordering semantics

Reply via email to