Re: Documenting VFS, the next installment (was: Tailmerging for Ext2)
[EMAIL PROTECTED] writes: Previously I was talking about three functions: 1) generic_file_read 2) generic_file_write 3) generic_file_mmap Here are some comments of mine, just because I'd like to check out my own understanding of this sort of thing and trying to explain it in writing is a good way of doing so. and had this to say: "Each of these functions works its magic through a 'mapping' defined by an address_space struct. In OO terminology this would be a memory mapping class with instances denoting the mapping between a section of memory and a particular kind of underlying secondary storage. So what we have looks a lot like a generalization of a swap file - all (or the vast majority of) file I/O in Linux is now done by memory-mapping..." "The three functions in question make use of a 'page cache', which appears to be a set of memory page 'heads' (page head - my terminology, since I haven't seen any other so far - an instance of struct page) The "struct page" represents (and is the unit of currency) of the physical page. You mention the idea of a MemoryPage lower down in the context of allocating memory but the struct page is even more basic. Every single physical page of "real" memory accessible to the kernel has/is a "struct page". Since there's no architecture-portable way of tacking on extra bits of data to a physical page, Linux does it by having a separate array (mem_map[]) indexed by page frame number and mem_map[1000] is a structure containging various bits and pieces of information associated with page frame number 1000. In passing, an architecture like S390 *does* in fact store some extra fields for each physical page: a key and some ref/access bits. So thinking of a struct page as a "real physical page" along with some private attributes (in an OO sense) is a good idea. A struct page represents a physical page of memory and has nothing to do with addresses: one may not even be able to directly write to that physical page from the kernel. For example, with PAE36 on x86, "high" memory pages (roughly speaking those beyond about page number 256000, or 1GB into physical memory) aren't mapped directly in the kernel's address space. They are reached by mapping them temporarily with calls such as roughly addr = kmap(page); ... can now use address addr to write/read to/from the page kunmap(page); or else they are mapped into a process address space with remap_page_range. The "addr = kmap(); ...; kunmap()" way of accessing memory reminds me of the PalmOS way of allocating and reading/writing memory where you are only granted temporary permission to write to memory, possibly via a different virtual address each time. It's a nice way to separate physical and virtual memory both as an API and while thinking about it. Obviously, Linux uses various caches and cleverness behind the scenes to avoid too much of a map/unmap/map/unmap performance hit. indexed by offset expressed in units of PAGE_SIZE (4096 on most architectures), munged together with the address of the mapping struct. So we use the page offset within a file plus a unique characteristic of the file (its mapping struct), followed by a linear search to find a piece of memory that has already been set up to map the file/offset we're interested in." The page cache is simply a fast hash look up from (addressspace, offset) to the struct page holding that data. It used to be (inode, offset) but, from an OO point of view, there's no reason why the "lookup key" needs to be an inode and so inventing a new type of "addressspace" object (of which an inode is one example) works nicely. Understanding the mapping between pages and buffers is crucial to understanding those three functions. In (1) and (2) operations are performed directly on the pages and buffers, and these operations are quite easy to understand in spite of the fact that the function call chain wanders around goes very deep at times. At least it's all synchronous. Being synchronous may make the coding simpler (and hence possibly easier to understand looking through it) but I find the underlying abstractions easier to understand without worrying about whether it's asynchronous/synchronous. Basically, you can (a) initiate I/O to write/read into a physical page, (b) ask whether a particular page is uptodate/dirty, (c) do a blocking wait for a particular page to become up to date and, sort of, (d) ask to get your own function called back when an I/O you initiated is complete. Thinking about it that way lets you separate the vm, filesystem, and I/O request subsystems without them interfering with each other. Again, nice from the OO point of view. 115 struct vm_operations_struct { 116 void (*open) 117 void (*close) 118 void (*unmap) 119 void (*protect) 120 int (*sync) 121 struct page * (*nopage) 122 struct page * (*wppage) 123 int (*swapout) 124 }; At this point
Re: Tailmerging for Ext2
Alexander Viro wrote: On Wed, 26 Jul 2000, Daniel Phillips wrote: Stephen asked me some sharp questions about how this would work, and after I answered them to his satisfaction he asked me if I would have time to implement this feature. I said yes, and went on to write an initial design document describing the required modifications to Ext's handling of inodes, and a prototype algorithm for doing the tail merging. Here is one more for you: Suppose we grow the last fragment/tail/whatever. Do you copy the data out of that shared block? If so, how do you update buffer_heads in pages that cover the relocated data? (Same goes for reiserfs, if they are doing something similar). BTW, our implementation of UFS is fucked up in that respect, so variant from there will not work. Hmmm, this is the question I faced in 1993. Balanced trees were the answer. Hans
Re: Tailmerging for Ext2
Post your document on the reiserfs mailing list when you finish it, the ReiserFS team will enjoy reading it. Hans Daniel Phillips wrote: Alexander Viro wrote: On Wed, 26 Jul 2000, Stephen C. Tweedie wrote: On Wed, Jul 26, 2000 at 03:19:46PM -0400, Alexander Viro wrote: Erm? Consider that: huge lseek() + write past the end of file. Woops - got to unmerge the tail (it's an internal block now) and we've got no knowledge of IO going on the page. Again, IO may be asynchronous - no protection from i_sem for us. After that page becomes a regular one, right? Looks like a change of state to me... Naturally, and that change of state must be made atomically by the filesystem. Yep. Which is the point - there _are_ dragons. I believe that it's doable, but I realy want to repeat: Daniel, watch out for races at the moments when page state changes, it needs more accurate approach than usual pagecache-using fs. It can be done, but it will take some reading (and yes, Stephen, I know that _you_ know it ;-) That's apparent, and I feel that Stephen could probably implement the entire tail merge as described so far in few days. But that wouldn't be as useful as having me and perhaps some interested observers others go all the way through the exercise of figuring out the so-far unwritten rules of the buffercache/pagecache duo. The exact same accurate work is required for Tux2, which makes massive use of copy-on-write. Right now, buffer issues are the main thing standing in the way of making a development code release for Tux2. So there is no question in my mind about whether such issues have to be dealt with: they do. I dove into the 2.4.0 cache code for the first time last night (using lxr - try it, you'll like it) and I'm almost at the point where I have some relevant questions to ask. I notice that buffer.c has increased in size by almost 50% and is far and away the largest module in the VFS. Worse, buffer.c is massively cross-coupled to the mm subsystem and the page cache, as we know too well. Buffer.c is right at the core of the issues we're talking about. Bearing that in mind, instead of just jumping in and starting to code I'll try the methodical approach :-) My immediate objective is to try clarify a few things that aren't immediately obvious from the source, in the following areas: - States and transitions for the main objects: - Buffer heads - Buffer data - Page heads - Page data - Other? - Existing concurrency controls: - Semaphores/Spinlocks - Big kernel lock - Filesystem locks - Posix locks? - Other? - Planned additions/deletions of concurrency controls I will also try to make a list of the main internal functions in the VFS (and some related ones from the mm and drivers modules) and examine function-by-function what the intended usage is, what the issues/caveats are, and maybe even how we can expect them to evolve in the future. I think we need even more than this in terms of documentation in order to work effectively, but this at least will be a good start. It will be more than what we have now. If it gets to the point where we can actually answer questions about race conditions by consulting the docs then we really will have accomplished something. Yes, I know that the code is going to keep evolving and sometimes will break the docs, but I also have confidence that the docs can keep up with such evolution given some interested volunteer doc maintainers willing to hang out on the devel list and keep asking questions. Even in 2.2.x I felt that there is a lot of understated elegance in Linux's buffer cache design. In 2.4.0 it seems to be getting more elegant, although it's hard to say exactly, because of the sparse (read: nonexistent) documentation. This is a problem that can be easily fixed. To get through this I will have to ask a lot of naive-sounding questions. Hopefully I'll have the first batch ready this afternoon (morning, your time). -- Daniel
Re: Tailmerging for Ext2
Alexander Viro wrote: On Wed, 26 Jul 2000, Stephen C. Tweedie wrote: For tail writes, I'd imagine we would just end up using the page cache as a virtual cache as NFS uses it, and doing plain copy into the buffer cache pages. Ouch. I _really_ don't like it - we end up with special behaviour on one page in the pagecache. And getting data migration from buffer cache to page cache, which is Not Nice(tm). Yuck... Besides, when do we decide that tail is going to be, erm, merged? What will happen with the page then? Notwithstanding the fact that I intend to try to figure this out for myself over the weekend by reading the code, could I please have a simple explanation of the relationship between the page cache and buffer cache? Specific questions: What shall we call the struct that describes a page - a page header? This is a descriptor for a virtual page or a physical page? (I think: physical page.) What exactly is the function of the list of buffer pages hanging off a page descriptor? Do these hold the filesystem blocks that map the page in virtual memory? I'm sure all of this was discussed somewhere on the kernel list sometime in the past, and I missed it. -- Daniel
Re: Tailmerging for Ext2
Alexander Viro wrote: On Wed, 26 Jul 2000, Stephen C. Tweedie wrote: On Wed, Jul 26, 2000 at 03:19:46PM -0400, Alexander Viro wrote: Erm? Consider that: huge lseek() + write past the end of file. Woops - got to unmerge the tail (it's an internal block now) and we've got no knowledge of IO going on the page. Again, IO may be asynchronous - no protection from i_sem for us. After that page becomes a regular one, right? Looks like a change of state to me... Naturally, and that change of state must be made atomically by the filesystem. Yep. Which is the point - there _are_ dragons. I believe that it's doable, but I realy want to repeat: Daniel, watch out for races at the moments when page state changes, it needs more accurate approach than usual pagecache-using fs. It can be done, but it will take some reading (and yes, Stephen, I know that _you_ know it ;-) That's apparent, and I feel that Stephen could probably implement the entire tail merge as described so far in few days. But that wouldn't be as useful as having me and perhaps some interested observers others go all the way through the exercise of figuring out the so-far unwritten rules of the buffercache/pagecache duo. The exact same accurate work is required for Tux2, which makes massive use of copy-on-write. Right now, buffer issues are the main thing standing in the way of making a development code release for Tux2. So there is no question in my mind about whether such issues have to be dealt with: they do. I dove into the 2.4.0 cache code for the first time last night (using lxr - try it, you'll like it) and I'm almost at the point where I have some relevant questions to ask. I notice that buffer.c has increased in size by almost 50% and is far and away the largest module in the VFS. Worse, buffer.c is massively cross-coupled to the mm subsystem and the page cache, as we know too well. Buffer.c is right at the core of the issues we're talking about. Bearing that in mind, instead of just jumping in and starting to code I'll try the methodical approach :-) My immediate objective is to try clarify a few things that aren't immediately obvious from the source, in the following areas: - States and transitions for the main objects: - Buffer heads - Buffer data - Page heads - Page data - Other? - Existing concurrency controls: - Semaphores/Spinlocks - Big kernel lock - Filesystem locks - Posix locks? - Other? - Planned additions/deletions of concurrency controls I will also try to make a list of the main internal functions in the VFS (and some related ones from the mm and drivers modules) and examine function-by-function what the intended usage is, what the issues/caveats are, and maybe even how we can expect them to evolve in the future. I think we need even more than this in terms of documentation in order to work effectively, but this at least will be a good start. It will be more than what we have now. If it gets to the point where we can actually answer questions about race conditions by consulting the docs then we really will have accomplished something. Yes, I know that the code is going to keep evolving and sometimes will break the docs, but I also have confidence that the docs can keep up with such evolution given some interested volunteer doc maintainers willing to hang out on the devel list and keep asking questions. Even in 2.2.x I felt that there is a lot of understated elegance in Linux's buffer cache design. In 2.4.0 it seems to be getting more elegant, although it's hard to say exactly, because of the sparse (read: nonexistent) documentation. This is a problem that can be easily fixed. To get through this I will have to ask a lot of naive-sounding questions. Hopefully I'll have the first batch ready this afternoon (morning, your time). -- Daniel
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Daniel Phillips wrote: Stephen asked me some sharp questions about how this would work, and after I answered them to his satisfaction he asked me if I would have time to implement this feature. I said yes, and went on to write an initial design document describing the required modifications to Ext's handling of inodes, and a prototype algorithm for doing the tail merging. Here is one more for you: Suppose we grow the last fragment/tail/whatever. Do you copy the data out of that shared block? If so, how do you update buffer_heads in pages that cover the relocated data? (Same goes for reiserfs, if they are doing something similar). BTW, our implementation of UFS is fucked up in that respect, so variant from there will not work.
Re: Tailmerging for Ext2
Hi, On Wed, Jul 26, 2000 at 02:05:11PM -0400, Alexander Viro wrote: Here is one more for you: Suppose we grow the last fragment/tail/whatever. Do you copy the data out of that shared block? If so, how do you update buffer_heads in pages that cover the relocated data? (Same goes for reiserfs, if they are doing something similar). BTW, our implementation of UFS is fucked up in that respect, so variant from there will not work. For tail writes, I'd imagine we would just end up using the page cache as a virtual cache as NFS uses it, and doing plain copy into the buffer cache pages. Cheers, Stephen
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Stephen C. Tweedie wrote: Hi, On Wed, Jul 26, 2000 at 02:05:11PM -0400, Alexander Viro wrote: Here is one more for you: Suppose we grow the last fragment/tail/whatever. Do you copy the data out of that shared block? If so, how do you update buffer_heads in pages that cover the relocated data? (Same goes for reiserfs, if they are doing something similar). BTW, our implementation of UFS is fucked up in that respect, so variant from there will not work. For tail writes, I'd imagine we would just end up using the page cache as a virtual cache as NFS uses it, and doing plain copy into the buffer cache pages. Ouch. I _really_ don't like it - we end up with special behaviour on one page in the pagecache. And getting data migration from buffer cache to page cache, which is Not Nice(tm). Yuck... Besides, when do we decide that tail is going to be, erm, merged? What will happen with the page then?
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Alexander Viro wrote: On Wed, 26 Jul 2000, Daniel Phillips wrote: Stephen asked me some sharp questions about how this would work, and after I answered them to his satisfaction he asked me if I would have time to implement this feature. I said yes, and went on to write an initial design document describing the required modifications to Ext's handling of inodes, and a prototype algorithm for doing the tail merging. Here is one more for you: Suppose we grow the last fragment/tail/whatever. Do you copy the data out of that shared block? If so, how do you update buffer_heads in pages that cover the relocated data? (Same goes for reiserfs, if they are doing something similar). BTW, our implementation of UFS is fucked up in that respect, so variant from there will not work. Please bear in mind that I don't pretend to be an expert on the VFS, and especially its latest incarnation in 2.4.0. I'm coming to grips with it now. Notwithstanding that, I'll try to provide some insight anyway. Suppose we grow the last fragment/tail/whatever. Do you copy the data out of that shared block? Yes, except possibly in the case where the fragment grows by an amount will that will still fit in the shared block. Even in that case, you might want to ignore the possible optimization and copy it out mindlessly, on the assumption that another write is coming soon. My plan is to do the incremental merging at file close time. If so, how do you update buffer_heads in pages that cover the relocated data? We have to be sure that if blocks are buffered then they are buffered in exactly one place and you always access them through through the buffer hash table. So far so good, but the picture gets murkier for me when you talk about the page cache. I'm not clear yet on the details of how the buffer cache interacts with the page cache, and perhaps you can help shed some light on that. Until I am clear on it, I'll hold off commenting. (Same goes for reiserfs, if they are doing something similar). I don't know exactly what ReiserFS does - I just heard Hans mention the term 'tail merging' and I could see that it was a good idea. BTW, our implementation of UFS is fucked up in that respect, so variant from there will not work. I'm not sure what you mean there... -- Daniel
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Daniel Phillips wrote: If so, how do you update buffer_heads in pages that cover the relocated data? We have to be sure that if blocks are buffered then they are buffered in exactly one place and you always access them through through the buffer hash table. So far so good, but the picture gets murkier for me when you talk Not. Data normally is in page. Buffer_heads are not included into buffer cache. They are refered from the struct page and their -b_data just points to appropriate pieces of page. You can not get them via bread(). At all. Buffer cache is only for metadata. BTW, our implementation of UFS is fucked up in that respect, so variant from there will not work. I'm not sure what you mean there... I mean that UFS has the same problem (relocation of the last fragment) and our implementation is fucked up (== does not deal with that properly and eats data). So if you will look for existing solutions - forget about the UFS one; it isn't. UFS will need fixing, but that's a separate story...
Re: Tailmerging for Ext2
Hi, On Wed, Jul 26, 2000 at 02:56:01PM -0400, Alexander Viro wrote: Not. Data normally is in page. Buffer_heads are not included into buffer cache. They are refered from the struct page and their -b_data just points to appropriate pieces of page. You can not get them via bread(). At all. Buffer cache is only for metadata. Only in the default usage. There's no reason at all why we can't use separate buffer and page cache aliases of the same data for tails as a special case. Cheers, Stephen
Re: Tailmerging for Ext2
Hi, On Wed, Jul 26, 2000 at 02:41:44PM -0400, Alexander Viro wrote: For tail writes, I'd imagine we would just end up using the page cache as a virtual cache as NFS uses it, and doing plain copy into the buffer cache pages. Ouch. I _really_ don't like it - we end up with special behaviour on one page in the pagecache. Correct. But it's all inside the filesystem, so there is zero VFS impact. And we're talking about non-block-aligned data for tails, so we simply don't have a choice in this case. And getting data migration from buffer cache to page cache, which is Not Nice(tm). Not preferred for bulk data, perhaps, but the VFS should cope just fine. Yuck... Besides, when do we decide that tail is going to be, erm, merged? What will happen with the page then? To the page? Nothing. To the buffer? It gets updated with the new contents of disk. Page == virtual contents. Buffer == physical contents. Plain and simple. Cheers, Stephen
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Stephen C. Tweedie wrote: Hi, On Wed, Jul 26, 2000 at 02:56:01PM -0400, Alexander Viro wrote: Not. Data normally is in page. Buffer_heads are not included into buffer cache. They are refered from the struct page and their -b_data just points to appropriate pieces of page. You can not get them via bread(). At all. Buffer cache is only for metadata. Only in the default usage. There's no reason at all why we can't use separate buffer and page cache aliases of the same data for tails as a special case. In theory - yes, but doing that will require a _lot_ of accurate thinking about possible races. IOW, I'm afraid that transitions tail-normal block will be race-prone. Paint me over-cautious, but after you-know-what... Oh, well... I'm not saying that it's impossible, but I _really_ recommend to take a hard look at race scenarios - there is a potential for plenty of them.
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Stephen C. Tweedie wrote: Hi, On Wed, Jul 26, 2000 at 02:41:44PM -0400, Alexander Viro wrote: For tail writes, I'd imagine we would just end up using the page cache as a virtual cache as NFS uses it, and doing plain copy into the buffer cache pages. Ouch. I _really_ don't like it - we end up with special behaviour on one page in the pagecache. Correct. But it's all inside the filesystem, so there is zero VFS impact. And we're talking about non-block-aligned data for tails, so we simply don't have a choice in this case. shrug Sure, it's not a VFS problem (albeit it _will_ require accurate playing with unmap_() in buffer.c), but ext2 problems are pretty interesting too... And getting data migration from buffer cache to page cache, which is Not Nice(tm). Not preferred for bulk data, perhaps, but the VFS should cope just fine. Yuck... Besides, when do we decide that tail is going to be, erm, merged? What will happen with the page then? To the page? Nothing. To the buffer? It gets updated with the new contents of disk. Page == virtual contents. Buffer == physical contents. Plain and simple. Erm? Consider that: huge lseek() + write past the end of file. Woops - got to unmerge the tail (it's an internal block now) and we've got no knowledge of IO going on the page. Again, IO may be asynchronous - no protection from i_sem for us. After that page becomes a regular one, right? Looks like a change of state to me...
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Daniel Phillips wrote: I don't know exactly what ReiserFS does - I just heard Hans mention the term 'tail merging' and I could see that it was a good idea. I'll give the quick and dirty answer, if people want more details, let me know. In 2.2, reiserfs_file_write deals directly with tails. It appends to them if there is room in the packed block, or converts them if there isn't. If reiserfs_file_write is called with a buffer size 512 bytes, it tries to write into full blocks instead of tails. This limits the overhead when you cp/untar to create new files. In both releases, there is locking on the tail to prevent races, and we don't bother with tails on files 16k (configurable). For 2.4, the functions work like this: reiserfs_get_block converts the tail into its own block and points the buffer head at the new block. reiserfs_readpage reads directly from the tail into the page, leaves the buffer head mapped, and sets b_blocknr to 0. reiserfs_writepage and reiserfs_prepare_write both check for mapped buffer heads with a block number of 0 in the page. If found, they are unmapped. Then block_write_full_page or block_prepare_write is called. reiserfs_truncate deals directly with the tail. If the last block is packed back into the tail, it is unmapped from the page cache. reiserfs_file_release will check to see if the tail needs to be repacked, and use truncate (without changing i_size) to pack the tail. -chris
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Chris Mason wrote: In both releases, there is locking on the tail to prevent races, and we don't bother with tails on files 16k (configurable). What granularity do you have? (for tail size, that is).
Re: Tailmerging for Ext2
On Wed, 26 Jul 2000, Alexander Viro wrote: On Wed, 26 Jul 2000, Chris Mason wrote: In both releases, there is locking on the tail to prevent races, and we don't bother with tails on files 16k (configurable). What granularity do you have? (for tail size, that is). From 1 byte to almost the blocksize (4k). But, there is a macro for deciding when to use a tail, which varies it based on the file size. If the file 12k, it won't have a tail bigger than 1k, an 8k file won't have a tail bigger than 2k. Of course, this is just a guess about the right balance between space and performance... -chris