Hi Ryan, On 2015/12/10 16:48, Ryan Ding wrote: > Hi Joseph, > > Thanks for your comments, please see my reply: > > On 12/10/2015 03:54 PM, Joseph Qi wrote: >> Hi Ryan, >> >> On 2015/10/12 14:34, Ryan Ding wrote: >>> Hi Joseph, >>> >>> On 10/08/2015 02:13 PM, Joseph Qi wrote: >>>> Hi Ryan, >>>> >>>> On 2015/10/8 11:12, Ryan Ding wrote: >>>>> Hi Joseph, >>>>> >>>>> On 09/28/2015 06:20 PM, Joseph Qi wrote: >>>>>> Hi Ryan, >>>>>> I have gone through this patch set and done a simple performance test >>>>>> using direct dd, it indeed brings much performance promotion. >>>>>> Before After >>>>>> bs=4K 1.4 MB/s 5.0 MB/s >>>>>> bs=256k 40.5 MB/s 56.3 MB/s >>>>>> >>>>>> My questions are: >>>>>> 1) You solution is still using orphan dir to keep inode and allocation >>>>>> consistency, am I right? From our test, it is the most complicated part >>>>>> and has many race cases to be taken consideration. So I wonder if this >>>>>> can be restructured. >>>>> I have not got a better idea to do this. I think the only reason why >>>>> direct io using orphan is to prevent space lost when system crash during >>>>> append direct write. But maybe a 'fsck -f' will do that job. Is it >>>>> necessary to use orphan? >>>> The idea is taken from ext4, but since ocfs2 is cluster filesystem, so >>>> it is much more complicated than ext4. >>>> And fsck can only be used offline, but using orphan is to perform >>>> recovering online. So I don't think fsck can replace it in all cases. >>>> >>>>>> 2) Rather than using normal block direct io, you introduce a way to use >>>>>> write begin/end in buffer io. IMO, if it wants to perform like direct >>>>>> io, it should be committed to disk by forcing committing journal. But >>>>>> journal committing will consume much time. Why does it bring performance >>>>>> promotion instead? >>>>> I use buffer io to write only the zero pages. Actual data payload is >>>>> written as direct io. I think there is no need to do a force commit. >>>>> Because direct means "Try to minimize cache effects of the I/O to and >>>>> from this file.", it does not means "write all data & meta data to disk >>>>> before write return". >> I think we cannot mix zero pages with direct io here, which will lead >> to direct io data to be overwritten by zero pages. >> For example, a ocfs2 volume with block size 4K and cluster size 4K. >> Firstly I create a file with size of 5K and it will be allocated 2 >> clusters (8K) and the last 3K without zeroed (no need at this time). > I think the last 3K will be zeroed no matter you use direct io or buffer io > to create the a file with 5K. >> Then I seek to offset 9K and do direct write 1K, then back to 4K and do >> direct write 5K. Here we have to zero allocated space to avoid dirty >> data. But since direct write data goes to disk directly and zero pages >> depends on journal commit, so direct write data will be overwritten and >> file corrupts. > do_blockdev_direct_IO() will zero unwritten area within block size(in this > case, 6K~8K), when get_block callback return a map with buffer_new flag. This > zero operation is also using direct io. > So the buffer io zero operation in my design will not work at all in this > case.It only works to zero the area beyond block size, but within cluster > size. For example, when block size 4KB cluster size 1MB, a 4KB direct write > will trigger a zero buffer page of size 1MB-4KB=1020KB. > I think your question is this zero buffer page will conflict with the later > direct io writing to the same area. The truth is conflict will not exist, > because before direct write, all conflict buffer page will be flushed to disk > first (in __generic_file_write_iter()). How can it make sure the zero pages to be flushed to disk first? In ocfs2_direct_IO, it calls ocfs2_dio_get_block which uses write_begin and write_end, and then __blockdev_direct_IO. I've backported your patch set to kernel 3.0 and tested with vhd-util, and the result fails. The test case is below. 1) create a 1G dynamic vhd file, the actual size is 5K. # vhd-util create -n test.vhd -s 1024 2) resize it to 4G, the actual size becomes to 11K # vhd-util resize -n test.vhd -s 4096 -j test.log 3) hexdump the data, say hexdump1 4) umount to commit journal and mount again, and hexdump the data again, say hexdump2, which is not equal to hexdump1. I am not sure if there is any relations with kernel version, which indeed has many differences due to refactoring.
Thanks, Joseph > BTW, there is a lot testcases to test the operations like buffer write, > direct write, lseek.. (it's a mix of these operations) in ltp (Linux Test > Project). This patch set has passed all of them. :) >> >>>> So this is protected by "UNWRITTEN" flag, right? >>>> >>>>>> 3) Do you have a test in case of lack of memory? >>>>> I tested it in a system with 2GB memory. Is that enough? >>>> What I mean is doing many direct io jobs in case system free memory is >>>> low. >>> I understand what you mean, but did not find a better way to test it. Since >>> if free memory is too low, even the process can not be started. If free >>> memory is fairlyenough, the test has no meaning. >>> So I try to collect the memory usage during io, and do a comparison test >>> with buffer io. The result is: >>> 1. start 100 dd to do 4KB direct write: >>> [root@hnode3 ~]# cat /proc/meminfo | grep -E >>> "^Cached|^Dirty|^MemFree|^MemTotal|^Buffers|^Writeback:" >>> MemTotal: 2809788 kB >>> MemFree: 21824 kB >>> Buffers: 55176 kB >>> Cached: 2513968 kB >>> Dirty: 412 kB >>> Writeback: 36 kB >>> >>> 2. start 100 dd to do 4KB buffer write: >>> [root@hnode3 ~]# cat /proc/meminfo | grep -E >>> "^Cached|^Dirty|^MemFree|^MemTotal|^Buffers|^Writeback:" >>> MemTotal: 2809788 kB >>> MemFree: 22476 kB >>> Buffers: 15696 kB >>> Cached: 2544892 kB >>> Dirty: 320136 kB >>> Writeback: 146404 kB >>> >>> You can see from the 'Dirty' and 'Writeback' field that there is not so >>> much memory used as buffer io. So I think what you concern is no longer >>> exist. :-) >>> >>> Thanks, >>> Ryan >>>> Thanks, >>>> Joesph >>>> >>>>> Thanks, >>>>> Ryan _______________________________________________ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel