On Tue, Jan 11, 2011 at 11:27:33AM +0800, Li, Shaohua wrote: > On Tue, 2011-01-11 at 11:07 +0800, Wu, Fengguang wrote: > > On Tue, Jan 11, 2011 at 10:03:16AM +0800, Li, Shaohua wrote: > > > On Tue, 2011-01-11 at 09:38 +0800, Wu, Fengguang wrote: > > > > On Tue, Jan 11, 2011 at 08:15:19AM +0800, Li, Shaohua wrote: > > > > > On Mon, 2011-01-10 at 22:26 +0800, Wu, Fengguang wrote: > > > > > > Shaohua, > > > > > > > > > > > > On Tue, Jan 04, 2011 at 01:40:30PM +0800, Li, Shaohua wrote: > > > > > > > Hi, > > > > > > > We have file readahead to do asyn file read, but has no metadata > > > > > > > readahead. For a list of files, their metadata is stored in > > > > > > > fragmented > > > > > > > disk space and metadata read is a sync operation, which impacts > > > > > > > the > > > > > > > efficiency of readahead much. The patches try to add meatadata > > > > > > > readahead > > > > > > > for btrfs. > > > > > > > In btrfs, metadata is stored in btree_inode. Ideally, if we > > > > > > > could hook > > > > > > > the inode to a fd so we could use existing syscalls (readahead, > > > > > > > mincore > > > > > > > or upcoming fincore) to do readahead, but the inode is hidden, > > > > > > > there is > > > > > > > no easy way for this from my understanding. So we add two ioctls > > > > > > > for > > > > > > > > > > > > If that is the main obstacle, why not do straightforward fincore()/ > > > > > > fadvise(), and add ioctls to btrfs to export/grab the hidden > > > > > > btree_inode in any form? This will address btrfs' specific issue, > > > > > > and > > > > > > have the benefit of making the VFS part general enough. You know > > > > > > ext2/3/4 already have block_dev ready for metadata readahead. > > > > > I forgot to update this comment. Please see patch 2 and patch 4, both > > > > > incore and readahead need btrfs specific staff involved, so we can't > > > > > use > > > > > generic fincore or something. > > > > > > > > You can if you like :) > > > > > > > > - fincore() can return the referenced bit, which is generally > > > > useful information > > > metadata page in ext2/3 doesn't have reference bit set, while btrfs has. > > > we can't blindly filter out such pages with the bit. > > > > block_dev inodes have the accessed bits. Look at the below output. > > > > /dev/sda5 is a mounted ext4 partition. The 'A'/'R' in the > > dump_page_cache lines stand for Active/Referenced. > ext4 already does readahead? please check other filesystems.
ext3/4 does readahead on accessing large directories. However that's orthogonal feature to the user space metadata readahead. The latter is still important for fast boot on ext3/4. > filesystem sues bread like API to read metadata, which definitely > doesn't set referenced bit. __find_get_block() will call touch_buffer() which is a synonymous for mark_page_accessed(). > > r...@bay /home/wfg# echo /dev/sda5 > > > /debug/tracing/objects/mm/pages/dump-file > > r...@bay /home/wfg# cat /debug/tracing/trace > > # tracer: nop > > # > > # TASK-PID CPU# TIMESTAMP FUNCTION > > # | | | | | > > zsh-2950 [003] 879.500764: dump_inode_cache: 0 > > 55643986944 1703936 21879 D___ BLK mount /dev/sda5 > > zsh-2950 [003] 879.500774: dump_page_cache: 0 > > 2 ___AR_____P 2 0 > > zsh-2950 [003] 879.500776: dump_page_cache: 2 > > 3 ____R_____P 2 0 > > zsh-2950 [003] 879.500777: dump_page_cache: 1026 > > 5 ___AR_____P 2 0 > > zsh-2950 [003] 879.500778: dump_page_cache: 1031 > > 3 ___A______P 2 0 > > zsh-2950 [003] 879.500779: dump_page_cache: 1034 > > 1 ___AR_____P 2 0 > > zsh-2950 [003] 879.500780: dump_page_cache: 1035 > > 2 ___A______P 2 0 > > zsh-2950 [003] 879.500781: dump_page_cache: 1037 > > 1 ___AR_____P 2 0 > > zsh-2950 [003] 879.500782: dump_page_cache: 1038 > > 3 ____R_____P 2 0 > > zsh-2950 [003] 879.500782: dump_page_cache: 1041 > > 1 ___A______P 2 0 > > zsh-2950 [003] 879.500783: dump_page_cache: 1057 > > 1 ___AR_D___P 2 0 > > zsh-2950 [003] 879.500788: dump_page_cache: 1058 > > 6 ___A______P 2 0 > > zsh-2950 [003] 879.500788: dump_page_cache: 9249 > > 1 ___AR_____P 2 0 > > zsh-2950 [003] 879.500789: dump_page_cache: 524289 > > 1 ____R_____P 2 0 > > zsh-2950 [003] 879.500790: dump_page_cache: 524290 > > 2 ___A______P 2 0 > > zsh-2950 [003] 879.500790: dump_page_cache: 524292 > > 1 ___AR_____P 2 0 > > zsh-2950 [003] 879.500791: dump_page_cache: 524293 > > 1 ___A______P 2 0 > > zsh-2950 [003] 879.500796: dump_page_cache: 524294 > > 9 ____R_____P 2 0 > > zsh-2950 [003] 879.500797: dump_page_cache: 524303 > > 1 ___A______P 2 0 > > zsh-2950 [003] 879.500798: dump_page_cache: 987136 > > 1 ___AR_____P 2 0 > > zsh-2950 [003] 879.500798: dump_page_cache: 1048576 > > 1 ____R_____P 2 0 > > zsh-2950 [003] 879.500799: dump_page_cache: 1048577 > > 2 ___A______P 2 0 > > zsh-2950 [003] 879.500800: dump_page_cache: 1048579 > > 1 ___AR_____P 2 0 > > zsh-2950 [003] 879.500801: dump_page_cache: 1048580 > > 5 ___A______P 2 0 > > zsh-2950 [003] 879.500802: dump_page_cache: 1048585 > > 1 ___AR_____P 2 0 > > zsh-2950 [003] 879.500805: dump_page_cache: 1048586 > > 5 ___A______P 2 0 > > zsh-2950 [003] 879.500805: dump_page_cache: 1048591 > > 1 ___AR_____P 2 0 > > zsh-2950 [003] 879.500806: dump_page_cache: 1572864 > > 1 ____R_____P 2 0 > > zsh-2950 [003] 879.500807: dump_page_cache: 1572865 > > 5 ___A______P 2 0 > > zsh-2950 [003] 879.500808: dump_page_cache: 1572870 > > 1 ___AR_____P 2 0 > > zsh-2950 [003] 879.500811: dump_page_cache: 1572871 > > 6 ___A______P 2 0 > > zsh-2950 [003] 879.500812: dump_page_cache: 1572877 > > 3 ____R_____P 2 0 > > zsh-2950 [003] 879.500816: dump_page_cache: 2097153 > > 8 ____R_____P 2 0 > > zsh-2950 [003] 879.500817: dump_page_cache: 2097161 > > 1 ___A______P 2 0 > > zsh-2950 [003] 879.500818: dump_page_cache: 2097162 > > 4 ____R_____P 2 0 > > zsh-2950 [003] 879.500819: dump_page_cache: 6324224 > > 1 ____R_D___P 2 0 > > zsh-2950 [003] 879.500820: dump_page_cache: 6324225 > > 3 ___AR_____P 2 0 > > zsh-2950 [003] 879.500825: dump_page_cache: 6324228 > > 29 ___A______P 2 0 > > zsh-2950 [003] 879.500826: dump_page_cache: 6324257 > > 1 ____R_____P 2 0 > > zsh-2950 [003] 879.500828: dump_page_cache: 6324258 > > 4 ___A______P 2 0 > > zsh-2950 [003] 879.500830: dump_page_cache: 6324262 > > 11 ____R_____P 2 0 > > zsh-2950 [003] 879.500833: dump_page_cache: 6324273 > > 16 ___AR_____P 2 0 > > zsh-2950 [003] 879.500833: dump_page_cache: 6324289 > > 1 ___A______P 2 0 > > zsh-2950 [003] 879.500834: dump_page_cache: 6324290 > > 2 ___AR_____P 2 0 > > zsh-2950 [003] 879.500835: dump_page_cache: 6324292 > > 8 ___A______P 2 0 > > zsh-2950 [003] 879.500836: dump_page_cache: 6324300 > > 2 ___AR_____P 2 0 > > zsh-2950 [003] 879.500837: dump_page_cache: 6324302 > > 3 ___A______P 2 0 > > zsh-2950 [003] 879.500838: dump_page_cache: 6324305 > > 4 ____R_____P 2 0 > > zsh-2950 [003] 879.500843: dump_page_cache: 6324309 > > 28 ___AR_____P 2 0 > > zsh-2950 [003] 879.500844: dump_page_cache: 6324337 > > 4 ___A______P 2 0 > > zsh-2950 [003] 879.500845: dump_page_cache: 6324341 > > 2 ____R_____P 2 0 > > zsh-2950 [003] 879.500850: dump_page_cache: 6324343 > > 30 ___AR_____P 2 0 > > zsh-2950 [003] 879.500851: dump_page_cache: 6324373 > > 2 ___A______P 2 0 > > zsh-2950 [003] 879.500852: dump_page_cache: 6324375 > > 2 ___AR_____P 2 0 > > zsh-2950 [003] 879.500853: dump_page_cache: 6324377 > > 9 ___A______P 2 0 > > zsh-2950 [003] 879.500854: dump_page_cache: 6324386 > > 2 ___AR_____P 2 0 > > zsh-2950 [003] 879.500855: dump_page_cache: 6324388 > > 5 ___A______P 2 0 > > zsh-2950 [003] 879.500856: dump_page_cache: 6324393 > > 3 ___AR_____P 2 0 > > zsh-2950 [003] 879.500858: dump_page_cache: 6324396 > > 11 ___A______P 2 0 > > zsh-2950 [003] 879.500859: dump_page_cache: 6324407 > > 1 ____R_____P 2 0 > > zsh-2950 [003] 879.500864: dump_page_cache: 6324408 > > 31 ___AR_____P 2 0 > > zsh-2950 [003] 879.500864: dump_page_cache: 6324439 > > 1 ___A______P 2 0 > > zsh-2950 [003] 879.500865: dump_page_cache: 6324440 > > 1 ____R_____P 2 0 > > zsh-2950 [003] 879.500866: dump_page_cache: 6324441 > > 2 ___A______P 2 0 > > zsh-2950 [003] 879.500867: dump_page_cache: 6324443 > > 5 ____R_____P 2 0 > > zsh-2950 [003] 879.500872: dump_page_cache: 6324448 > > 26 ___AR_____P 2 0 > > zsh-2950 [003] 879.500873: dump_page_cache: 6324474 > > 6 ___A______P 2 0 > > zsh-2950 [003] 879.500874: dump_page_cache: 6324480 > > 4 ____R_____P 2 0 > > zsh-2950 [003] 879.500879: dump_page_cache: 6324484 > > 28 ___AR_____P 2 0 > > zsh-2950 [003] 879.500880: dump_page_cache: 6324512 > > 4 ___A______P 2 0 > > zsh-2950 [003] 879.500881: dump_page_cache: 6324516 > > 1 ____R_____P 2 0 > > zsh-2950 [003] 879.500881: dump_page_cache: 6324517 > > 1 ___A______P 2 0 > > zsh-2950 [003] 879.500882: dump_page_cache: 6324518 > > 2 ___AR_____P 2 0 > > zsh-2950 [003] 879.500888: dump_page_cache: 6324520 > > 28 ___A______P 2 0 > > zsh-2950 [003] 879.500890: dump_page_cache: 6324548 > > 2 ____R_____P 2 0 > > > > > fincore can takes a parameter or it returns a bit to distinguish > > > referenced pages, but I don't think it's a good API. This should be > > > transparent to userspace. > > > > Users care about the "cached" status may well be interested in the > > "active/referenced" status. They are co-related information. fincore() > > won't be a simple replication of mincore() anyway. fincore() has to > > deal with huge sparsely accessed files. The accessed bits of a file > > page are normally more meaningful than the accessed bits of mapped > > (anonymous) pages. > if all filesystems have the bit set, I'll buy-in. Otherwise, this isn't > generic enough. It's a reasonable thing to set the accessed bits. So I believe the various filesystems are calling mark_page_accessed() on their metadata inode, or can be changed to do it. > > Another option may be to use the above > > /debug/tracing/objects/mm/pages/dump-file interface. > > > > > > - btrfs_metadata_readahead() can be passed to some (faked) > > > > ->readpages() for use with fadvise. > > > this need filesystem specific hook too, the difference is your proposal > > > uses fadvise but I'm using ioctl. There isn't big difference. > > > > True for btrfs. However they make big differences for other file systems. > why? The block_dev of ext2/3/4 can do metadata query/readahead directly with fincore()+fadvise(), with no need for any additional ioctls. Given that the vast majority desktops are running ext2/3/4, it seems worthwhile to have a straightforward solution for them. > > > BTW, it's hard to hook btrfs_inode to a fd even with a ioctl, at least I > > > didn't find a easy way to do this. It might be possible to do this for > > > example adding a fake device or fake fs (anon_inode doesn't work here, > > > IIRC), which is a bit ugly. Before it's proved generic API can handle > > > metadata readahead, I don't want to do it. > > > > Right, it could be hard to export btrfs_inode. I'm glad you speak it > > out. If we cannot make it, it's valuable to point out the problem and > > let everyone know the root cause we turn to an ioctl based workaround. > > Then others will understand the design choices, and if lucky, join us > > and help export the btrfs_inode. > I didn't hide anything. I actually tell out this in the comments. this > is what I said. Ah, sorry for overlooking this message! Thanks, Fengguang > In btrfs, metadata is stored in btree_inode. Ideally, if we could hook > > > > > > > the inode to a fd so we could use existing syscalls > (readahead, mincore > > > > > > > or upcoming fincore) to do readahead, but the inode is > hidden, there is > > > > > > > no easy way for this from my understanding. > > > Thanks, > Shaohua > > > > > > > this. One is like readahead syscall, the other is like > > > > > > > micore/fincore > > > > > > > syscall. > > > > > > > Under a harddisk based netbook with Meego, the metadata > > > > > > > readahead > > > > > > > reduced about 3.5s boot time in average from total 16s. > > > > > > > Last time I posted similar patches to btrfs maillist, which > > > > > > > adds the > > > > > > > new ioctls in btrfs specific ioctl code. But Christoph Hellwig > > > > > > > asks we > > > > > > > have a generic interface to do this so other filesystem can share > > > > > > > some > > > > > > > code, so I came up with the new one. Comments and suggestions are > > > > > > > welcome! > > > > > > > > > > > > > > v1->v2: > > > > > > > 1. Added more comments and fix return values suggested by Andrew > > > > > > > Morton > > > > > > > 2. fix a race condition pointed out by Yan Zheng > > > > > > > > > > > > > > initial post: > > > > > > > http://marc.info/?l=linux-fsdevel&m=129222493406353&w=2 > > > > > > > > > > > > > > Thanks, > > > > > > > Shaohua > > > > > > > > > > > > > > -- > > > > > > > To unsubscribe from this list: send the line "unsubscribe > > > > > > > linux-fsdevel" in > > > > > > > the body of a message to [email protected] > > > > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > > > > > > > > > > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
