Re: patches for test / review
In message <[EMAIL PROTECTED]>, Greg Lehey writes : >> Hmm, try to keep vinum/RAID5 in the picture when you look at this >> code, it complicated matters a lot. > >I don't think it's that relevant, in fact. Yes it is, because the CPU needs to read the buffers to calculate the parity, it cannot just DMA the data out into hardware like a scsi controller can do. -- Poul-Henning Kamp FreeBSD coreteam member [EMAIL PROTECTED] "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
In the last episode (Mar 23), Greg Lehey said: > > Agreed. This is on the Vinum wishlist, but it comes at the expense of > reliability (how long do you wait to cluster? What happens if the > system fails in between?). In addition, for Vinum it needs to be done > before entering the hardware driver. For the simplest case, you can choose to optimize only when the user sends a single huge write(). That way you don't have to worry about caching dirty pages in vinum. This is basically what the hardware RAIDs I have do. They'll only do the write optimization (they call it "pipelining") if you actually send a single SCSI write request large enough to span all the disks. I don't know what would be required to get our kernel to even be able to write blocks this big (what's the upper limit on MAXPHYS)? -- Dan Nelson [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
On Tuesday, 21 March 2000 at 9:29:56 -0800, Matthew Dillon wrote: >>> >>> I would think that track-caches and intelligent drives would gain >>> much if not more of what clustering was designed to do gain. >> >> Hm. But I'd think that even with modern drives a smaller number of bigger >> I/Os is preferable over lots of very small I/Os. Or have I missed the point? > > As long as you do not blow away the drive's cache with your big I/O's, > and as long as you actually use all the returned data, it's definitely > more efficient to issue larger I/O's. > > If you generate requests that are too large - say over 1/4 the size of > the drive's cache, the drive will not be able to optimize parallel > requests as well. I think that in the majority of cases there's no need to transfer more than requested. It could only apply to reads anyway, and the drive cache probably has this data anyway. In RAID adapters, it seems to almost always be due to poor firmware design. For regular files, it might be an idea to set a flag to indicate whether read-ahead has any hope of being useful (for example, on an ftp server the answer would be "yes"; for index-sequential files or such the answer would normally be "no". Greg -- Finger [EMAIL PROTECTED] for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
On Monday, 20 March 2000 at 22:52:59 +0100, Poul-Henning Kamp wrote: > In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: >> * Poul-Henning Kamp <[EMAIL PROTECTED]> [000320 11:45] wrote: >>> In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: >>> Keeping the currect cluster code is a bad idea, if the drivers were taught how to traverse the linked list in the buf struct rather than just notice "a big buffer" we could avoid a lot of page twiddling and also allow for massive IO clustering ( > 64k ) >>> >>> Before we redesign the clustering, I would like to know if we >>> actually have any recent benchmarks which prove that clustering >>> is overall beneficial ? >> >> Yes it is really benificial. >> >> I'm not talking about a redesign of the clustering code as much as >> making the drivers that take a callback from it actually traverse >> the 'union cluster_info' rather than relying on the system to fake >> the pages being contiguous via remapping. >> >> There's nothing wrong with the clustering algorithms, it's just the >> steps it has to take to work with the drivers. > > Hmm, try to keep vinum/RAID5 in the picture when you look at this > code, it complicated matters a lot. I don't think it's that relevant, in fact. Greg -- Finger [EMAIL PROTECTED] for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
On Monday, 20 March 2000 at 15:23:31 -0600, Dan Nelson wrote: > In the last episode (Mar 20), Poul-Henning Kamp said: >> In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: >>> * Poul-Henning Kamp <[EMAIL PROTECTED]> [000320 11:45] wrote: Before we redesign the clustering, I would like to know if we actually have any recent benchmarks which prove that clustering is overall beneficial ? >>> >>> Yes it is really benificial. >> >> I would like to see some numbers if you have them. > > For hardware RAID arrays that support it, if you can get the system to > issue writes that are larger than the entire RAID-5 stripe size, your > immensely slow "read parity/recalc parity/write parity/write data" > operations turn into "recalc parity for entire stripe/write entire > stripe". RAID-5 magically achieves RAID-0 write speeds! Given 32k > granularity, and 8 disks per RAID group, you'll need a write > size of 32*7 = 224k. Given 64K granularity and 27 disks, that's 1.6MB. > > I have seen the jump in write throughput as I tuned an Oracle > database's parameters on both Solaris and DEC Unix boxes. Get Oracle > to write blocks larger than a RAID-5 stripe, and it flies. Agreed. This is on the Vinum wishlist, but it comes at the expense of reliability (how long do you wait to cluster? What happens if the system fails in between?). In addition, for Vinum it needs to be done before entering the hardware driver. Greg -- Finger [EMAIL PROTECTED] for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
> >>Eventually all physical I/O needs a physical address. The quickest > >>way to get to a physical address is to be given an array of vm_page_t's > >>(which can be trivially translated to physical addresses). > > > > Not all: PIO access to ATA needs virtual access. RAID5 needs > > virtual access to calculate parity. > > I'm not sure what you mean by "virtual access". If you mean > file-related rather than partition-related, no: like the rest of > Vinum, RAID-5 uses only partition-related offsets. No, the issue here has to do with the mapping of the data buffers. If you're doing PIO, or otherwise manipulating the data in the driver before you give it to the hardware (eg. inside vinum) then you need the data buffers mapped into your virtual address space. OTOH, if you're handing the buffer information to a busmaster device, you don't need this, instead you need the physical address of the buffer sections. > > For RAID5 we have the opposite problem also: data is created which > > has only a mapped existance and the b_pages[] array is not > > populated. > > Hmm. I really need to check that I'm not missing something here. The point here is that when you create RAID5 parity data, the buffer's physical addresses aren't filled in. -- \\ Give a man a fish, and you feed him for a day. \\ Mike Smith \\ Tell him he should learn how to fish himself, \\ [EMAIL PROTECTED] \\ and he'll hate you for a lifetime. \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
On Monday, 20 March 2000 at 14:04:48 -0800, Matthew Dillon wrote: > > If a particular subsystem needs b_data, then that subsystem is obviously > willing to take the virtual mapping / unmapping hit. If you look at > Greg's current code this is, in fact, what is occuring the critical > path through the buffer cache in a heavily loaded system tends to require > a KVA mapping *AND* a KVA unmapping on every buffer access (just that the > unmappings tend to be for unrelated buffers). The reason this occurs > is because even with the larger amount of KVA we made available to the > buffer cache in 4.x, there still isn't enough to leave mappings intact > for long periods of time. A 'systat -vm 1' will show you precisely > what I mean (also sysctl -a | fgrep bufspace). > > So we will at least not be any worse off then we are now, and probably > better off since many of the buffers in the new system will not have > to be mapped. For example, when vinum's RAID5 breaks up a request > and issues a driveio() it passes a buffer which is assigned to b_data > which must be translated (through page table lookups) to physical > addresses anyway, so the fact that that vinum does not populate > b_pages[] does *NOT* help it in the least. It actually makes the job > harder. I think you may be confusing two things, though it doesn't seem to make much difference. driveio() is used only for accesses to the configuration information; normal Vinum I/O goes via launch_requests() (in vinumrequest.c). And it's not just RAID-5 that breaks up a request, it's any access that goes over more than one subdisk (even concatenated plexes in exceptional cases). Greg -- Finger [EMAIL PROTECTED] for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
On Monday, 20 March 2000 at 20:17:13 +0100, Poul-Henning Kamp wrote: > In message <[EMAIL PROTECTED]>, Matthew Dillon writes: > >>Well, let me tell you what the fuzzy goal is first and then maybe we >>can work backwards. >> >>Eventually all physical I/O needs a physical address. The quickest >>way to get to a physical address is to be given an array of vm_page_t's >>(which can be trivially translated to physical addresses). > > Not all: PIO access to ATA needs virtual access. RAID5 needs > virtual access to calculate parity. I'm not sure what you mean by "virtual access". If you mean file-related rather than partition-related, no: like the rest of Vinum, RAID-5 uses only partition-related offsets. >>What we want to do is to try to extend VMIO (aka the vm_page_t) all >>the way through the I/O system - both VFS and DEV I/O, in order to >>remove all the nasty back and forth translations. > > I agree, but some drivers need mapping we need to cater for those. > They could simply call a vm_something(struct buf *) call which would > map the pages and things would "just work". > > For RAID5 we have the opposite problem also: data is created which > has only a mapped existance and the b_pages[] array is not > populated. Hmm. I really need to check that I'm not missing something here. Greg -- Finger [EMAIL PROTECTED] for PGP public key See complete headers for address and phone numbers To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
> On Tue, Mar 21, 2000 at 01:14:45PM -0800, Rodney W. Grimes wrote: > > > On Tue, Mar 21, 2000 at 09:29:56AM -0800, Matthew Dillon wrote: > > > > :> > > > > :> I would think that track-caches and intelligent drives would gain > > > > :> much if not more of what clustering was designed to do gain. > > > > : > > > > :Hm. But I'd think that even with modern drives a smaller number of bigger > > > > :I/Os is preferable over lots of very small I/Os. Or have I missed the point? > > > > > > > As long as you do not blow away the drive's cache with your big I/O's, > > > > and as long as you actually use all the returned data, it's definitely > > > > more efficient to issue larger I/O's. > > > > > > Prefetching data that is never used is obviously a waste. 256K might be a > > > bit big, I was thinking of something like 64-128Kb > > > > > > Drive caches tend to be 0.5-1Mbyte (on SCSI disks) for modern drives. > > > > Your a bit behind the times with that set of numbers for modern SCSI > > drives. It is now 1 to 16 Mbyte of cache, with 2 and 4Mbyte being the > > most common. > > Your drives are more modern than mine ;-) What drive has 16 Mb? Curious > here.. Seagates latest and greatest drives have a 4MB cache standard and an option for 16MB. These are 10K RPM chetta drives. -- Rod Grimes - KD7CAX @ CN85sl - (RWG25) [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
On Tue, Mar 21, 2000 at 01:14:45PM -0800, Rodney W. Grimes wrote: > > On Tue, Mar 21, 2000 at 09:29:56AM -0800, Matthew Dillon wrote: > > > :> > > > :> I would think that track-caches and intelligent drives would gain > > > :> much if not more of what clustering was designed to do gain. > > > : > > > :Hm. But I'd think that even with modern drives a smaller number of bigger > > > :I/Os is preferable over lots of very small I/Os. Or have I missed the point? > > > > > As long as you do not blow away the drive's cache with your big I/O's, > > > and as long as you actually use all the returned data, it's definitely > > > more efficient to issue larger I/O's. > > > > Prefetching data that is never used is obviously a waste. 256K might be a > > bit big, I was thinking of something like 64-128Kb > > > > Drive caches tend to be 0.5-1Mbyte (on SCSI disks) for modern drives. > > Your a bit behind the times with that set of numbers for modern SCSI > drives. It is now 1 to 16 Mbyte of cache, with 2 and 4Mbyte being the > most common. Your drives are more modern than mine ;-) What drive has 16 Mb? Curious here.. -- Wilko Bulte Arnhem, The Netherlands http://www.tcja.nl The FreeBSD Project: http://www.freebsd.org To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
> On Tue, Mar 21, 2000 at 09:29:56AM -0800, Matthew Dillon wrote: > > :> > > :> I would think that track-caches and intelligent drives would gain > > :> much if not more of what clustering was designed to do gain. > > : > > :Hm. But I'd think that even with modern drives a smaller number of bigger > > :I/Os is preferable over lots of very small I/Os. Or have I missed the point? > > > As long as you do not blow away the drive's cache with your big I/O's, > > and as long as you actually use all the returned data, it's definitely > > more efficient to issue larger I/O's. > > Prefetching data that is never used is obviously a waste. 256K might be a > bit big, I was thinking of something like 64-128Kb > > Drive caches tend to be 0.5-1Mbyte (on SCSI disks) for modern drives. Your a bit behind the times with that set of numbers for modern SCSI drives. It is now 1 to 16 Mbyte of cache, with 2 and 4Mbyte being the most common. -- Rod Grimes - KD7CAX @ CN85sl - (RWG25) [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
On Tue, Mar 21, 2000 at 09:29:56AM -0800, Matthew Dillon wrote: > :> > :> I would think that track-caches and intelligent drives would gain > :> much if not more of what clustering was designed to do gain. > : > :Hm. But I'd think that even with modern drives a smaller number of bigger > :I/Os is preferable over lots of very small I/Os. Or have I missed the point? > As long as you do not blow away the drive's cache with your big I/O's, > and as long as you actually use all the returned data, it's definitely > more efficient to issue larger I/O's. Prefetching data that is never used is obviously a waste. 256K might be a bit big, I was thinking of something like 64-128Kb Drive caches tend to be 0.5-1Mbyte (on SCSI disks) for modern drives. I happen to hate write-caching on disk drives so I did not consider that as a factor. > If you generate requests that are too large - say over 1/4 the size of > the drive's cache, the drive will not be able to optimize parallel > requests as well. True. -- Wilko Bulte Arnhem, The Netherlands http://www.tcja.nl The FreeBSD Project: http://www.freebsd.org To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
On Mon, Mar 20, 2000 at 11:54:58PM -0800, Matthew Jacob wrote: > > > > Hm. But I'd think that even with modern drives a smaller number of bigger > > I/Os is preferable over lots of very small I/Os. > > Not necessarily. It depends upon overhead costs per-i/o. With larger I/Os, you > do pay in interference costs (you can't transfer data for request N because > the 256Kbytes of request M is still in the pipe). OK. 256K might be a bit on the high side. -- Wilko Bulte Arnhem, The Netherlands http://www.tcja.nl The FreeBSD Project: http://www.freebsd.org To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
:> :> I would think that track-caches and intelligent drives would gain :> much if not more of what clustering was designed to do gain. : :Hm. But I'd think that even with modern drives a smaller number of bigger :I/Os is preferable over lots of very small I/Os. Or have I missed the point? : :-- :Wilko BulteArnhem, The Netherlands :http://www.tcja.nl The FreeBSD Project: http://www.freebsd.org As long as you do not blow away the drive's cache with your big I/O's, and as long as you actually use all the returned data, it's definitely more efficient to issue larger I/O's. If you generate requests that are too large - say over 1/4 the size of the drive's cache, the drive will not be able to optimize parallel requests as well. -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
:> Hm. But I'd think that even with modern drives a smaller number of bigger :> I/Os is preferable over lots of very small I/Os. : :Not necessarily. It depends upon overhead costs per-i/o. With larger I/Os, you :do pay in interference costs (you can't transfer data for request N because :the 256Kbytes of request M is still in the pipe). This problem has scaled over the last few years. With 5 MB/sec SCSI busses it was a problem. With 40, 80, and 160 MB/sec it isn't as big an issue any more. 256K @ 40 MBytes/sec = 6.25 mS. 256K @ 80 MBytes/sec = 3.125 mS. When you add in write-decoupling (take softupdates, for example), the issue become even less of a problem. The biggest single item that does not scale well is command/response overhead. I think it has been successfully argued (but I forgot who made the point) that 64K is not quite into the sweet spot - that 256K is closer to the mark. But one has to be careful to only issue large requests for things that are actually going to be used. If you read 256K but only use 8K of it, you just wasted a whole lot of cpu and bus bandwidth. -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
> > Hm. But I'd think that even with modern drives a smaller number of bigger > I/Os is preferable over lots of very small I/Os. Not necessarily. It depends upon overhead costs per-i/o. With larger I/Os, you do pay in interference costs (you can't transfer data for request N because the 256Kbytes of request M is still in the pipe). To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
On Mon, Mar 20, 2000 at 08:21:52PM +0100, Poul-Henning Kamp wrote: > In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: > > >Keeping the currect cluster code is a bad idea, if the drivers were > >taught how to traverse the linked list in the buf struct rather > >than just notice "a big buffer" we could avoid a lot of page > >twiddling and also allow for massive IO clustering ( > 64k ) > > Before we redesign the clustering, I would like to know if we > actually have any recent benchmarks which prove that clustering > is overall beneficial ? > > I would think that track-caches and intelligent drives would gain > much if not more of what clustering was designed to do gain. Hm. But I'd think that even with modern drives a smaller number of bigger I/Os is preferable over lots of very small I/Os. Or have I missed the point? -- Wilko Bulte Arnhem, The Netherlands http://www.tcja.nl The FreeBSD Project: http://www.freebsd.org To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: I/O clustering, Re: patches for test / review
> I agree that it is obvious for NFS, but I don't see it as being > obvious at all for (modern) disks, so for that case I would like > to see numbers. > > If running without clustering is just as fast for modern disks, > I think the clustering needs rethought. I think it should be pretty obvious, actually. Command overhead is large (and not getting much smaller), and clustering primarily serves to reduce the number of commands and thus the ratio of command time vs. data time. So unless the clustering implementation is extremely poor, it's worthwhile. -- \\ Give a man a fish, and you feed him for a day. \\ Mike Smith \\ Tell him he should learn how to fish himself, \\ [EMAIL PROTECTED] \\ and he'll hate you for a lifetime. \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: >* Poul-Henning Kamp <[EMAIL PROTECTED]> [000320 11:45] wrote: >> In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: >> >> >Keeping the currect cluster code is a bad idea, if the drivers were >> >taught how to traverse the linked list in the buf struct rather >> >than just notice "a big buffer" we could avoid a lot of page >> >twiddling and also allow for massive IO clustering ( > 64k ) >> >> Before we redesign the clustering, I would like to know if we >> actually have any recent benchmarks which prove that clustering >> is overall beneficial ? > >Yes it is really benificial. > >I'm not talking about a redesign of the clustering code as much as >making the drivers that take a callback from it actually traverse >the 'union cluster_info' rather than relying on the system to fake >the pages being contiguous via remapping. > >There's nothing wrong with the clustering algorithms, it's just the >steps it has to take to work with the drivers. Hmm, try to keep vinum/RAID5 in the picture when you look at this code, it complicated matters a lot. -- Poul-Henning Kamp FreeBSD coreteam member [EMAIL PROTECTED] "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: I/O clustering, Re: patches for test / review
>>Committing a 64k block would require 8 times the overhead of bundling >>up the RPC as well as transmission and reply, it may be possible >>to pipeline these commits because you don't really need to wait >>for one to complete before issueing another request, but it's still >>8x the amount of traffic. > >I agree that it is obvious for NFS, but I don't see it as being >obvious at all for (modern) disks, so for that case I would like >to see numbers. > >If running without clustering is just as fast for modern disks, >I think the clustering needs rethought. Depends on the type of disk drive and how it is configured. Some drives perform badly (skip a revolution) with back-to-back writes. In all cases, without aggregation of blocks, you pay the extra cost of additional interrupts and I/O rundowns, which can be a significant factor. Also, unless the blocks were originally written by the application in a chunk, they will likely be mixed with blocks to varying locations, in which case for drives without write caching enabled, you'll have additional seeks to write the blocks out. Things like this don't show up when doing simplistic sequential write tests. -DG David Greenman Co-founder/Principal Architect, The FreeBSD Project - http://www.freebsd.org Creator of high-performance Internet servers - http://www.terasolutions.com Pave the road of life with opportunities. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: I/O clustering, Re: patches for test / review
:> :>I agree that it is obvious for NFS, but I don't see it as being :>obvious at all for (modern) disks, so for that case I would like :>to see numbers. :> :>If running without clustering is just as fast for modern disks, :>I think the clustering needs rethought. : : Depends on the type of disk drive and how it is configured. Some drives :perform badly (skip a revolution) with back-to-back writes. In all cases, :without aggregation of blocks, you pay the extra cost of additional interrupts :and I/O rundowns, which can be a significant factor. Also, unless the blocks :were originally written by the application in a chunk, they will likely be :mixed with blocks to varying locations, in which case for drives without :write caching enabled, you'll have additional seeks to write the blocks out. :Things like this don't show up when doing simplistic sequential write tests. : :-DG : :David Greenman :Co-founder/Principal Architect, The FreeBSD Project - http://www.freebsd.org I have an excellent example of this related to NFS. It's still applicable even though the NFS point has already been conceeded. As part of the performance enhancements package I extended the sequential detection heuristic to the NFS server side code and turned on clustering. On the server, mind you, not the client. Read performance went up drastically. My 100BaseTX network instantly maxed out and, more importantly, the server side cpu use went down drastically. Here is the relevant email from my archives describing the performance gains: :From: dillon :To: Alfred Perlstein <[EMAIL PROTECTED]> :Cc: Alan Cox <[EMAIL PROTECTED]>, Julian Elischer <[EMAIL PROTECTED]> :Date: Sun Dec 12 10:11:06 1999 : :... :This proposed patch allows us to maintain a sequential read heuristic :on the server side. I noticed that the NFS server side reads only 8K :blocks from the physical media even when the NFS client is reading a :file sequentially. : :With this heuristic in place I can now get 9.5 to 10 MBytes/sec reading :over NFS on a 100BaseTX network, and the server winds up being 80% :idle. Under -stable the same test runs 72% idle and 8.4 MBytes/sec. This is in spite of the fact that in this sequential test the hard drives were caching the read data ahead anyway. The reduction in command/response/interrupt overhead on the server by going from 8K read I/O's to 64K read I/O's in the sequential case made an obvious beneficial impact on the cpu. I almost halved the cpu overhead on the server! So while on-disk caching makes a lot of sense, it is in no way able to replace software clustering. Having both working together is a killer combination. -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
In the last episode (Mar 20), Poul-Henning Kamp said: > In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: > >* Poul-Henning Kamp <[EMAIL PROTECTED]> [000320 11:45] wrote: > >> > >> Before we redesign the clustering, I would like to know if we > >> actually have any recent benchmarks which prove that clustering is > >> overall beneficial ? > > > >Yes it is really benificial. > > I would like to see some numbers if you have them. For hardware RAID arrays that support it, if you can get the system to issue writes that are larger than the entire RAID-5 stripe size, your immensely slow "read parity/recalc parity/write parity/write data" operations turn into "recalc parity for entire stripe/write entire stripe". RAID-5 magically achieves RAID-0 write speeds! Given 32k granularity, and 8 disks per RAID group, you'll need a write size of 32*7 = 224k. Given 64K granularity and 27 disks, that's 1.6MB. I have seen the jump in write throughput as I tuned an Oracle database's parameters on both Solaris and DEC Unix boxes. Get Oracle to write blocks larger than a RAID-5 stripe, and it flies. -- Dan Nelson [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: I/O clustering, Re: patches for test / review
: :* Poul-Henning Kamp <[EMAIL PROTECTED]> [000320 12:03] wrote: :> In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: :> >* Poul-Henning Kamp <[EMAIL PROTECTED]> [000320 11:45] wrote: :> >> In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: :> >> :> >> >Keeping the currect cluster code is a bad idea, if the drivers were :> >> >taught how to traverse the linked list in the buf struct rather :> >> >than just notice "a big buffer" we could avoid a lot of page :> >> >twiddling and also allow for massive IO clustering ( > 64k ) :> >> :> >> Before we redesign the clustering, I would like to know if we :> >> actually have any recent benchmarks which prove that clustering :> >> is overall beneficial ? :> > :> >Yes it is really benificial. :> :> I would like to see some numbers if you have them. : :No I don't have numbers. : :Committing a 64k block would require 8 times the overhead of bundling :up the RPC as well as transmission and reply, it may be possible :to pipeline these commits because you don't really need to wait Clustering is extremely beneficial. DG and I and I think even BDE and Tor have done a lot of random tests in that area. I did a huge amount of clustering related work while optimizing NFSv3 and fixing up the random/sequential I/O heuristics for 4.0 (for both NFS and UFS). The current clustering code does a pretty good job and I would hesitate to change it at this time. The only real overhead comes from the KVA pte mappings for b_data in the pbuf that the clustering (and other) code uses. I do not think that redoing the clustering will have a beneficial result until *after* we optimize the I/O path as per my previous posting. Once we optimize the I/O path to make it more VM Object centric, it will make it a whole lot easier to remove *ALL* the artificial I/O size limitations. -Matt To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: I/O clustering, Re: patches for test / review
Just as a perhaps interesting aside on this topic; it'd be quite neat for controllers that understand scatter/gather to be able to simply suck N regions of buffer cache which were due for committing directly into an S/G list... (wishlist item, I guess 8) -- \\ Give a man a fish, and you feed him for a day. \\ Mike Smith \\ Tell him he should learn how to fish himself, \\ [EMAIL PROTECTED] \\ and he'll hate you for a lifetime. \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
Alfred Perlstein wrote: > > * Poul-Henning Kamp <[EMAIL PROTECTED]> [000320 11:45] wrote: > > In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: > > > > >Keeping the currect cluster code is a bad idea, if the drivers were > > >taught how to traverse the linked list in the buf struct rather > > >than just notice "a big buffer" we could avoid a lot of page > > >twiddling and also allow for massive IO clustering ( > 64k ) > > > > Before we redesign the clustering, I would like to know if we > > actually have any recent benchmarks which prove that clustering > > is overall beneficial ? > > Yes it is really benificial. Yes, I've seen stats that show the degradation when clustering is switched off. Richard Wendlake (who wrote the OS detection code for the Netcraft web server survey) did a lot of testing in this area because of some pathological behavior he was seeing using Gnu's dbm package. Richard, do you want to post a summary of your tests? > > I'm not talking about a redesign of the clustering code as much as > making the drivers that take a callback from it actually traverse > the 'union cluster_info' rather than relying on the system to fake > the pages being contiguous via remapping. > > There's nothing wrong with the clustering algorithms, it's just the > steps it has to take to work with the drivers. Well, there is something wrong with our clustering algorithm. It always starts a new cluster when the first block of a file is written to. I found this when trying to explain some of the pathological behavior that Richard was seeing. Imagine an algorithm that will write blocks 0,5,2,7,4,1,6,3,0,... The clustering algorithm starts a new cluster if the block is at the beginning of the file, so writing block 0 will always start a new cluster. When block 5 is written out, the clustering code will try and add it to the existing cluster, will fail and so will flush the existing cluster which only has block 0 in it and then start another cluster, with block 5 in it. This continues, with the previous cluster being flushed and a new cluster being created with the current block in it. Eventually, we get to the point where 7 blocks have been flushed and the current cluster contains block 3. When it comes to write out the next block 0 the clustering algorithm doesn't bother trying to add the block to the existing cluster but immediately starts a new one so the cluster with block 3 in it *never gets flushed*. Paul. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
:> lock on the bp. With a shared lock you are allowed to issue READ :> I/O but you are not allowed to modify the contents of the buffer. :> With an exclusive lock you are allowed to issue both READ and WRITE :> I/O and you can modify the contents of the buffer. :> :> bread() -> bread_sh() and bread_ex() :> :> Obtain and validate (issue read I/O as appropriate) a bp. bread_sh() :> allows a buffer to be accessed but not modified or rewritten. :> bread_ex() allows a buffer to be modified and written. : :This seems to allow for expressing intent to write to buffers, :which would be an excellent place to cow the pages 'in software' :rather than obsd's way of using cow'd pages to accomplish the same :thing. Yes, absolutely. DG (if I remember right) is rabid about not taking VM faults while sitting in the kernel and I tend to agree with him that it's a cop-out to use VM faults in the kernel to get around those sorts of problems. :I'm not sure if you remeber what I brought up at BAFUG, but I'd :like to see something along the lines of BX_BKGRDWRITE that Kirk :is using for the bitmaps blocks in softupdates to be enabled on a :system wide basis. That way rewriting data that has been sent to :the driver isn't blocked and at the same time we don't need to page :protect during every strategy call. : :I may have misunderstood your intent, but using page protections :on each IO would seem to introduce a lot of performance issues that :the rest of these points are all trying to get rid of. At the low-level device there is no concept of page protections. If you pass an array of vm_page_t's then that is where the data will be taken from or written to. A background-write capability is actually much more easily implemented at the VM Object level then the buffer cache level. If you think about it, all you need to do is add another VM Object layer *below* the one representing the device. Whenever a device write is initiated the pages are moved to the underlying layer. If a process (or the kernel) needs to modify the pages while the write is in progress, a copy-on-write occurs through normal mechanisms. On completion of the I/O the pages are moved back to the main VM Object device layer except for those that would conflict with any copy-on-write that occured (the original device pages in the conflict case simply get thrown away). Problem solved. Plus this deals with low-memory situations properly... we do not introduce any new deadlocks. :> The idea for the buffer cache is to shift its functionality to one that :> is solely used to issue device I/O and to keep track of dirty areas for :> proper sequencing of I/O (e.g. softupdate's use of the buffer cache :> to placemark I/O will not change). The core buffer cache code would :... : :Keeping the currect cluster code is a bad idea, if the drivers were :taught how to traverse the linked list in the buf struct rather :than just notice "a big buffer" we could avoid a lot of page :twiddling and also allow for massive IO clustering ( > 64k ) because :we won't be limited by the size of the b_pages[] array for our :upper bound on the amount of buffers we can issue effectively a :scatter/gather on (since the drivers must VTOPHYS them anyway). This devolves down into how simple (or complex) an interface we are willing to use to talk to the low-level device. The reason I would hesitate to move to a 'linked list of buffers' methodology is that *ALL* of the current VM API's pass a single array of vm_page_t's... not just the current struct buf code, but also the VOP_PUTPAGES and VOP_GETPAGES API. I would much prefer to keep this simplicity intact in order to avoid introducing even more bugs into the source then we will when we try to do this stuff, which means changing the clustering code from: * copies vm_page_t's into the cluster pbuf's b_pages[] array * maps the pages into b_data to: * copies vm_page_t's into the cluster pbuf's b_pages[] array In otherwords, keeping the clustering changes as simple as possible. I think once the new I/O path is operational we can then start thinking about how to optimize it -- for example, by having a default (embedded) static array but also allowing the b_pages array to be dynamically allocated. :To realize my "nfs super commit" stuff all we'd need to do is make :the max cluster size something like 0-1 and instantly get an almost :unbounded IO burst. : :-- :-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] : -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
* Matthew Dillon <[EMAIL PROTECTED]> [000320 14:18] wrote: > > :>lock on the bp. With a shared lock you are allowed to issue READ > :>I/O but you are not allowed to modify the contents of the buffer. > :>With an exclusive lock you are allowed to issue both READ and WRITE > :>I/O and you can modify the contents of the buffer. > :> > :> bread() -> bread_sh() and bread_ex() > :> > :>Obtain and validate (issue read I/O as appropriate) a bp. bread_sh() > :>allows a buffer to be accessed but not modified or rewritten. > :>bread_ex() allows a buffer to be modified and written. > : > :This seems to allow for expressing intent to write to buffers, > :which would be an excellent place to cow the pages 'in software' > :rather than obsd's way of using cow'd pages to accomplish the same > :thing. > > Yes, absolutely. DG (if I remember right) is rabid about not taking > VM faults while sitting in the kernel and I tend to agree with him that > it's a cop-out to use VM faults in the kernel to get around those > sorts of problems. ok, so we're on the same page then. :) > > :I'm not sure if you remeber what I brought up at BAFUG, but I'd > :like to see something along the lines of BX_BKGRDWRITE that Kirk > :is using for the bitmaps blocks in softupdates to be enabled on a > :system wide basis. That way rewriting data that has been sent to > :the driver isn't blocked and at the same time we don't need to page > :protect during every strategy call. > : > :I may have misunderstood your intent, but using page protections > :on each IO would seem to introduce a lot of performance issues that > :the rest of these points are all trying to get rid of. > > At the low-level device there is no concept of page protections. > If you pass an array of vm_page_t's then that is where the data > will be taken from or written to. > > A background-write capability is actually much more easily implemented > at the VM Object level then the buffer cache level. If you think about > it, all you need to do is add another VM Object layer *below* the > one representing the device. Whenever a device write is initiated the > pages are moved to the underlying layer. If a process (or the kernel) > needs to modify the pages while the write is in progress, a copy-on-write > occurs through normal mechanisms. On completion of the I/O the pages > are moved back to the main VM Object device layer except for those that > would conflict with any copy-on-write that occured (the original device > pages in the conflict case simply get thrown away). > > Problem solved. Plus this deals with low-memory situations properly... > we do not introduce any new deadlocks. That does sound a lot better, using the buffer system for anything more than describing an IO is a hack and I'd like to see an implementation such as this be possible. > > :> The idea for the buffer cache is to shift its functionality to one that > :> is solely used to issue device I/O and to keep track of dirty areas for > :> proper sequencing of I/O (e.g. softupdate's use of the buffer cache > :> to placemark I/O will not change). The core buffer cache code would > :... > : > :Keeping the currect cluster code is a bad idea, if the drivers were > :taught how to traverse the linked list in the buf struct rather > :than just notice "a big buffer" we could avoid a lot of page > :twiddling and also allow for massive IO clustering ( > 64k ) because > :we won't be limited by the size of the b_pages[] array for our > :upper bound on the amount of buffers we can issue effectively a > :scatter/gather on (since the drivers must VTOPHYS them anyway). > > This devolves down into how simple (or complex) an interface we > are willing to use to talk to the low-level device. > > The reason I would hesitate to move to a 'linked list of buffers' > methodology is that *ALL* of the current VM API's pass a single > array of vm_page_t's... not just the current struct buf code, but also > the VOP_PUTPAGES and VOP_GETPAGES API. > > I would much prefer to keep this simplicity intact in order to avoid > introducing even more bugs into the source then we will when we try > to do this stuff, which means changing the clustering code from: > > * copies vm_page_t's into the cluster pbuf's b_pages[] array > * maps the pages into b_data > > to: > > > * copies vm_page_t's into the cluster pbuf's b_pages[] array > > In otherwords, keeping the clustering changes as simple as possible. > I think once the new I/O path is operational we can then start thinking > about how to optimize it -- for example, by having a default (embedded) > static array but also allowing the b_pages array to be dynamically > allocated. Why? Why allocate a special buffer pbuf just for all of this, problems can develop wher
Re: patches for test / review
: :In message <[EMAIL PROTECTED]>, Matthew Dillon writes: : :>Well, let me tell you what the fuzzy goal is first and then maybe we :>can work backwards. :> :>Eventually all physical I/O needs a physical address. The quickest :>way to get to a physical address is to be given an array of vm_page_t's :>(which can be trivially translated to physical addresses). : :Not all: :PIO access to ATA needs virtual access. :RAID5 needs virtual access to calculate parity. ... which means that the initial implementation for PIO and RAID5 utilizes the mapped-buffer bioops interface rather then the b_pages[] bioops interface. But here's the point: We need to require that all entries *INTO* the bio system start with at least b_pages[] and then generate b_data only when necessary. If a particular device needs a b_data mapping, it can get one, but I think it would be a huge mistake to allow entry into the device subsystem to utilize *either* a b_data mapping *or* a b_pages[] mapping. Big mistake. There has to be a lowest common denominator that the entire system can count on and it pretty much has to be an array of vm_page_t's. If a particular subsystem needs b_data, then that subsystem is obviously willing to take the virtual mapping / unmapping hit. If you look at Greg's current code this is, in fact, what is occuring the critical path through the buffer cache in a heavily loaded system tends to require a KVA mapping *AND* a KVA unmapping on every buffer access (just that the unmappings tend to be for unrelated buffers). The reason this occurs is because even with the larger amount of KVA we made available to the buffer cache in 4.x, there still isn't enough to leave mappings intact for long periods of time. A 'systat -vm 1' will show you precisely what I mean (also sysctl -a | fgrep bufspace). So we will at least not be any worse off then we are now, and probably better off since many of the buffers in the new system will not have to be mapped. For example, when vinum's RAID5 breaks up a request and issues a driveio() it passes a buffer which is assigned to b_data which must be translated (through page table lookups) to physical addresses anyway, so the fact that that vinum does not populate b_pages[] does *NOT* help it in the least. It actually makes the job harder. -Matt Matthew Dillon <[EMAIL PROTECTED]> :-- :Poul-Henning Kamp FreeBSD coreteam member :[EMAIL PROTECTED] "Real hackers run -current on their laptop." :FreeBSD -- It will take a long time before progress goes too far! : To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: I/O clustering, Re: patches for test / review
In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: >> >> Before we redesign the clustering, I would like to know if we >> >> actually have any recent benchmarks which prove that clustering >> >> is overall beneficial ? >> > >> >Yes it is really benificial. >> >> I would like to see some numbers if you have them. > >No I don't have numbers. > >Committing a 64k block would require 8 times the overhead of bundling >up the RPC as well as transmission and reply, it may be possible >to pipeline these commits because you don't really need to wait >for one to complete before issueing another request, but it's still >8x the amount of traffic. I agree that it is obvious for NFS, but I don't see it as being obvious at all for (modern) disks, so for that case I would like to see numbers. If running without clustering is just as fast for modern disks, I think the clustering needs rethought. -- Poul-Henning Kamp FreeBSD coreteam member [EMAIL PROTECTED] "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
I/O clustering, Re: patches for test / review
* Poul-Henning Kamp <[EMAIL PROTECTED]> [000320 12:03] wrote: > In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: > >* Poul-Henning Kamp <[EMAIL PROTECTED]> [000320 11:45] wrote: > >> In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: > >> > >> >Keeping the currect cluster code is a bad idea, if the drivers were > >> >taught how to traverse the linked list in the buf struct rather > >> >than just notice "a big buffer" we could avoid a lot of page > >> >twiddling and also allow for massive IO clustering ( > 64k ) > >> > >> Before we redesign the clustering, I would like to know if we > >> actually have any recent benchmarks which prove that clustering > >> is overall beneficial ? > > > >Yes it is really benificial. > > I would like to see some numbers if you have them. No I don't have numbers. Committing a 64k block would require 8 times the overhead of bundling up the RPC as well as transmission and reply, it may be possible to pipeline these commits because you don't really need to wait for one to complete before issueing another request, but it's still 8x the amount of traffic. You also complicate and penalize drivers because not all drivers can add an IO request to an already started transaction, those devices will need to start new transactions for each buffer instead of bundling up the list and passing it all along. Maybe I'm missing something. Is there something to provide a clean way to cluster IO, can you suggest something that won't have this sort of impact on NFS (and elsewhere) if the clustering code was removed? Bruce, what part of the clustering code makes you think of it as hurting us, I thought it was mapping code? > -- > Poul-Henning Kamp FreeBSD coreteam member > [EMAIL PROTECTED] "Real hackers run -current on their laptop." > FreeBSD -- It will take a long time before progress goes too far! -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: >* Poul-Henning Kamp <[EMAIL PROTECTED]> [000320 11:45] wrote: >> In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: >> >> >Keeping the currect cluster code is a bad idea, if the drivers were >> >taught how to traverse the linked list in the buf struct rather >> >than just notice "a big buffer" we could avoid a lot of page >> >twiddling and also allow for massive IO clustering ( > 64k ) >> >> Before we redesign the clustering, I would like to know if we >> actually have any recent benchmarks which prove that clustering >> is overall beneficial ? > >Yes it is really benificial. I would like to see some numbers if you have them. -- Poul-Henning Kamp FreeBSD coreteam member [EMAIL PROTECTED] "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
* Poul-Henning Kamp <[EMAIL PROTECTED]> [000320 11:45] wrote: > In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: > > >Keeping the currect cluster code is a bad idea, if the drivers were > >taught how to traverse the linked list in the buf struct rather > >than just notice "a big buffer" we could avoid a lot of page > >twiddling and also allow for massive IO clustering ( > 64k ) > > Before we redesign the clustering, I would like to know if we > actually have any recent benchmarks which prove that clustering > is overall beneficial ? Yes it is really benificial. I'm not talking about a redesign of the clustering code as much as making the drivers that take a callback from it actually traverse the 'union cluster_info' rather than relying on the system to fake the pages being contiguous via remapping. There's nothing wrong with the clustering algorithms, it's just the steps it has to take to work with the drivers. > > I would think that track-caches and intelligent drives would gain > much if not more of what clustering was designed to do gain. > > I seem to remember Bruce saying that clustering could even hurt ? Yes because of the gyrations it needs to go through to maintain backward compatibility for devices that want to see "one big buffer" rather than simply follow a linked list of io operations. Not true, at least for 'devices' like NFS where large IO ops issued save milliseconds in overhead. Unless each device was to re-buffer IO (which is silly) or scan the vp passed to it (violating the adstraction and being really scary like my flopped super-commit stuff for NFS) it would make NFS performance even worse for doing commits. Without clustering you'd have to issue a commit RPC for each 8k block With the current clustering you have to issue a commit for each 64k block With an unbounded linked list, well there is only the limit that the filesystem asks for. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
In message <[EMAIL PROTECTED]>, Alfred Perlstein writes: >Keeping the currect cluster code is a bad idea, if the drivers were >taught how to traverse the linked list in the buf struct rather >than just notice "a big buffer" we could avoid a lot of page >twiddling and also allow for massive IO clustering ( > 64k ) Before we redesign the clustering, I would like to know if we actually have any recent benchmarks which prove that clustering is overall beneficial ? I would think that track-caches and intelligent drives would gain much if not more of what clustering was designed to do gain. I seem to remember Bruce saying that clustering could even hurt ? -- Poul-Henning Kamp FreeBSD coreteam member [EMAIL PROTECTED] "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
In message <[EMAIL PROTECTED]>, Matthew Dillon writes: >Well, let me tell you what the fuzzy goal is first and then maybe we >can work backwards. > >Eventually all physical I/O needs a physical address. The quickest >way to get to a physical address is to be given an array of vm_page_t's >(which can be trivially translated to physical addresses). Not all: PIO access to ATA needs virtual access. RAID5 needs virtual access to calculate parity. >What we want to do is to try to extend VMIO (aka the vm_page_t) all >the way through the I/O system - both VFS and DEV I/O, in order to >remove all the nasty back and forth translations. I agree, but some drivers need mapping we need to cater for those. They could simply call a vm_something(struct buf *) call which would map the pages and things would "just work". For RAID5 we have the opposite problem also: data is created which has only a mapped existance and the b_pages[] array is not populated. >In regards to odd block sizes and offsets the real question is whether >an attempt should be made to translate UIO ops into buffer cache b_pages[] >ops directly, maintaining offsets and odd sizes, or whether we should >back-off to a copy scheme where we allocate b_pages[] for oddly sized >uio's and then copy the data to the uio buffer. I don't know of any non DEV_BSIZE aligned apps that are sufficiently high-profile and high-performance to justify too much code to avoid a copy operation, so I guess that is OK. >My personal preference is to not pollute the VMIO page-passing mechanism >with all sorts of fields to handle weird offsets and sizes. Instead we >ought to take the copy hit for the non-optimal cases, and simply fix all >the programs doing the accesses to pass optimally aligned buffers. For >example, for a raw-I/O on an audio CD track you would pass a page-aligned >buffer with a request size of at least a page (e.g. 4K on IA32) in your >read(), and the raw device would return '2352' as the result and the >returned data would be page-aligned. No protest from here. Encouraging people to think about their data and the handling of them will always have my vote :-) -- Poul-Henning Kamp FreeBSD coreteam member [EMAIL PROTECTED] "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
* Matthew Dillon <[EMAIL PROTECTED]> [000320 10:01] wrote: > > : > : > :>Kirk and I have already mapped out a plan to drastically update > :>the buffer cache API which will encapsulate much of the state within > :>the buffer cache module. > : > :Sounds good. Combined with my stackable BIO plans that sounds like > :a really great win for FreeBSD. > : > :-- > :Poul-Henning Kamp FreeBSD coreteam member > :[EMAIL PROTECTED] "Real hackers run -current on their laptop." > > I think so. I can give -current a quick synopsis of the plan but I've > probably forgotten some of the bits (note: the points below are not > in any particular order): > * Cleanup the buffer cache API (bread(), BUF_STRATEGY(), and so forth). > Specifically, split out the call functionality such that the buffer > cache can determine whether a buffer being obtained is going to be > used for reading or writing. At the moment we don't know if the system > is going to dirty a buffer until after the fact and this has caused a > lot of pain in regards to dealing with low-memory situations. > > getblk() -> getblk_sh() and getblk_ex() > > Obtain bp without issuing I/O, getting either a shared or exclusive > lock on the bp. With a shared lock you are allowed to issue READ > I/O but you are not allowed to modify the contents of the buffer. > With an exclusive lock you are allowed to issue both READ and WRITE > I/O and you can modify the contents of the buffer. > > bread() -> bread_sh() and bread_ex() > > Obtain and validate (issue read I/O as appropriate) a bp. bread_sh() > allows a buffer to be accessed but not modified or rewritten. > bread_ex() allows a buffer to be modified and written. This seems to allow for expressing intent to write to buffers, which would be an excellent place to cow the pages 'in software' rather than obsd's way of using cow'd pages to accomplish the same thing. I'm not sure if you remeber what I brought up at BAFUG, but I'd like to see something along the lines of BX_BKGRDWRITE that Kirk is using for the bitmaps blocks in softupdates to be enabled on a system wide basis. That way rewriting data that has been sent to the driver isn't blocked and at the same time we don't need to page protect during every strategy call. I may have misunderstood your intent, but using page protections on each IO would seem to introduce a lot of performance issues that the rest of these points are all trying to get rid of. > The idea for the buffer cache is to shift its functionality to one that > is solely used to issue device I/O and to keep track of dirty areas for > proper sequencing of I/O (e.g. softupdate's use of the buffer cache > to placemark I/O will not change). The core buffer cache code would > no longer map things to KVM with b_data, that functionality would be > shifted to the VM Object vm_pager_*() API. The buffer cache would > continue to use the b_pages[] array mechanism to collect pages for I/O, > for clustering, and so forth. Keeping the currect cluster code is a bad idea, if the drivers were taught how to traverse the linked list in the buf struct rather than just notice "a big buffer" we could avoid a lot of page twiddling and also allow for massive IO clustering ( > 64k ) because we won't be limited by the size of the b_pages[] array for our upper bound on the amount of buffers we can issue effectively a scatter/gather on (since the drivers must VTOPHYS them anyway). To realize my "nfs super commit" stuff all we'd need to do is make the max cluster size something like 0-1 and instantly get an almost unbounded IO burst. -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
:Thanks for the sketch. It sounds really good. : :Is it your intention that drivers which cannot work from the b_pages[] :array will call to map them into VM, or will a flag on the driver/dev_t/ :whatever tell the generic code that it should be mapped before calling :the driver ? : :What about unaligned raw transfers, say a raw CD read of 2352 bytes :from userland ? I pressume we will need an offset into the first :page for that ? Well, let me tell you what the fuzzy goal is first and then maybe we can work backwards. Eventually all physical I/O needs a physical address. The quickest way to get to a physical address is to be given an array of vm_page_t's (which can be trivially translated to physical addresses). The buffer cache already has such an array, called b_pages[]. Any I/O that runs through b_data or runs through a uio must eventually be cut up into blocks of contiguous physical addresses. What we want to do is to try to extend VMIO (aka the vm_page_t) all the way through the I/O system - both VFS and DEV I/O, in order to remove all the nasty back and forth translations. In regards to raw devices I originally envisioned having two BUF_*() strategy calls - one that uses a page array, and one that uses b_data. But your idea below - using bio_ops[], is much better. In regards to odd block sizes and offsets the real question is whether an attempt should be made to translate UIO ops into buffer cache b_pages[] ops directly, maintaining offsets and odd sizes, or whether we should back-off to a copy scheme where we allocate b_pages[] for oddly sized uio's and then copy the data to the uio buffer. My personal preference is to not pollute the VMIO page-passing mechanism with all sorts of fields to handle weird offsets and sizes. Instead we ought to take the copy hit for the non-optimal cases, and simply fix all the programs doing the accesses to pass optimally aligned buffers. For example, for a raw-I/O on an audio CD track you would pass a page-aligned buffer with a request size of at least a page (e.g. 4K on IA32) in your read(), and the raw device would return '2352' as the result and the returned data would be page-aligned. This would allow the system call to use the b_pages[] strategy entry point even for devices with odd sizes and still get optimal (zero-copy) operation. If the user passes a non-aligned (or mulitiple of a page-sized) buffer, the system takes the copy hit in order to keep the lower level I/O interface clean. :One thing I would like to see is for the buffers to know how to :write themselves. There is nothing which mandates that a buffer :be backed by a disk-like device, and there are uses for buffers :which aren't. : :Being able to say bp->bop_write(bp) rather than bwrite(bp) would :allow that flexibility. Kirk already introduced a bio_ops[] but :made it global for now, that should be per buffer and have all the :bufferops in it, (except for the onces which instantiate the buffer). : :If we had this, pseudo filesystems like DEVFS could use UFS for :much of their naming management. This is currently impossible. : :-- :Poul-Henning Kamp FreeBSD coreteam member :[EMAIL PROTECTED] "Real hackers run -current on their laptop." :FreeBSD -- It will take a long time before progress goes too far! I like the idea of dynamicizing bio_ops[] and using that to issue struct buf based I/O. It fits very nicely into the general idea of separating the VFS and DEV I/O interfaces (they are currently hopelessly intertwined). Actually, the more I think about it the more I'm willing to just say to hell with it and start doing all the changes all at once, in parallel, including the two patches you wanted reviewed earlier (though I would request that you not combine disparate patch funcitonalities into a single patch set). I agree with Julian on the point about IPSEC. Dynamicizing bio_ops[] ought to be trivial. -Matt Matthew Dillon <[EMAIL PROTECTED]> To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
In message <[EMAIL PROTECTED]>, Matthew Dillon writes: >I think so. I can give -current a quick synopsis of the plan but I've >probably forgotten some of the bits (note: the points below are not >in any particular order): Thanks for the sketch. It sounds really good. Is it your intention that drivers which cannot work from the b_pages[] array will call to map them into VM, or will a flag on the driver/dev_t/ whatever tell the generic code that it should be mapped before calling the driver ? What about unaligned raw transfers, say a raw CD read of 2352 bytes from userland ? I pressume we will need an offset into the first page for that ? One thing I would like to see is for the buffers to know how to write themselves. There is nothing which mandates that a buffer be backed by a disk-like device, and there are uses for buffers which aren't. Being able to say bp->bop_write(bp) rather than bwrite(bp) would allow that flexibility. Kirk already introduced a bio_ops[] but made it global for now, that should be per buffer and have all the bufferops in it, (except for the onces which instantiate the buffer). If we had this, pseudo filesystems like DEVFS could use UFS for much of their naming management. This is currently impossible. -- Poul-Henning Kamp FreeBSD coreteam member [EMAIL PROTECTED] "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
: : :>Kirk and I have already mapped out a plan to drastically update :>the buffer cache API which will encapsulate much of the state within :>the buffer cache module. : :Sounds good. Combined with my stackable BIO plans that sounds like :a really great win for FreeBSD. : :-- :Poul-Henning Kamp FreeBSD coreteam member :[EMAIL PROTECTED] "Real hackers run -current on their laptop." I think so. I can give -current a quick synopsis of the plan but I've probably forgotten some of the bits (note: the points below are not in any particular order): Probably the most important thing to keep in mind when reading over this list is to note that nearly all the changes being contemplated can be implemented without breaking current interfaces, and the current interfaces can then be shifted over to the new interfaces one subsystem at a time (shift, test, shift, test, shift test) until none of the original use remains. At the point the support for the original API can be removed. * make VOP locking calls recursive. That is, to obtain exclusive recursive locks by default rather then non-recursive locks. * cleanup all VOP_*() interfaces in regards to the special handling of the case where a locked vnode is passed, a locked vnode is returned, and the returned vnode happens to wind up being the same as the locked vnode (Allow a double-locked vnode on return and get rid of all the stupid code that juggles locks around to get around the non-recursive nature of current exclusive locks). VOP_LOOKUP is the most confused interface that needs cleaning up. With only a small amount of additional work, mainly KASERT's to catch potential problems, we should be able to turn on exclusive recursion. The VOP_*() interfaces will have to be fixed one at a time with VOP_LOOKUP topping the list. * Make exclusive buffer cache locks recursive. Kirk has completed all the preliminary work on this and we should be able to just turn it on. We just haven't gotten around to it (and the release got in the way). This is necessary to support up and coming softupdates mechanisms (e.g. background fsck, snapshot dumps) as well as better-support device recursion. * Cleanup the buffer cache API (bread(), BUF_STRATEGY(), and so forth). Specifically, split out the call functionality such that the buffer cache can determine whether a buffer being obtained is going to be used for reading or writing. At the moment we don't know if the system is going to dirty a buffer until after the fact and this has caused a lot of pain in regards to dealing with low-memory situations. getblk() -> getblk_sh() and getblk_ex() Obtain bp without issuing I/O, getting either a shared or exclusive lock on the bp. With a shared lock you are allowed to issue READ I/O but you are not allowed to modify the contents of the buffer. With an exclusive lock you are allowed to issue both READ and WRITE I/O and you can modify the contents of the buffer. bread() -> bread_sh() and bread_ex() Obtain and validate (issue read I/O as appropriate) a bp. bread_sh() allows a buffer to be accessed but not modified or rewritten. bread_ex() allows a buffer to be modified and written. * Many uses of the buffer cache in the critical path do not actually require the buffer data to be mapped into KVM. For example, a number of I/O devices need only the b_pages[] array and do not need a b_data mapping. It would not take a huge amount of work to adjust the uiomove*() interfaces appropriately. The general plan is to try remove whole portions of the current buffer cache funcitonality and shift them into the new vm_pager_*() API. That is, to operate on VM Object's directly whenever possible. The idea for the buffer cache is to shift its functionality to one that is solely used to issue device I/O and to keep track of dirty areas for proper sequencing of I/O (e.g. softupdate's use of the buffer cache to placemark I/O will not change). The core buffer cache code would no longer map things to KVM with b_data, that functionality would be shifted to the VM Object vm_pager_*() API. The buffer cache would continue to use the b_pages[] array mechanism to collect pages for I/O, for clustering, and so forth. It should be noted that the buffer cache's perceived slowness is almost entirely due to all the KVM manipulation it does for b_data, and that such manipulate is not necessary for the vast majority of the critical path: Reading and writing file data (can run through the VM Object API), and issuing I/O (can avoid b_data KVM mappings entirely). Meta data, such as
Re: patches for test / review
>Kirk and I have already mapped out a plan to drastically update >the buffer cache API which will encapsulate much of the state within >the buffer cache module. Sounds good. Combined with my stackable BIO plans that sounds like a really great win for FreeBSD. -- Poul-Henning Kamp FreeBSD coreteam member [EMAIL PROTECTED] "Real hackers run -current on their laptop." FreeBSD -- It will take a long time before progress goes too far! To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: patches for test / review
:I have two patches up for test at http://phk.freebsd.dk/misc : :I'm looking for reviews and tests, in particular vinum testing :would be nice since Grog is quasi-offline at the moment. : :Poul-Henning : :2317 BWRITE-STRATEGY.patch : :This patch is machine generated except for the ccd.c and buf.h :parts. : :Rename existing BUF_STRATEGY to DEV_STRATEGY :substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo); :substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo); : :Please test & review. : : :2317 b_iocmd.patch : :This patch removes B_READ, B_WRITE and B_FREEBUF and replaces :them with a new field in struct buf: b_iocmd. : :B_WRITE was bogusly defined as zero giving rise to obvious :coding mistakes and a lot of code implicitly knew this. : :This patch also eliminates the redundant flag B_CALL, it can :just as efficiently be done by comparing b_iodone to NULL. : :Should you get a panic or drop into the debugger, complaining about :"b_iocmd", don't continue, it is likely to write where it should :have read. : :Please test & review. Kirk and I have already mapped out a plan to drastically update the buffer cache API which will encapsulate much of the state within the buffer cache module. I don't think it makes much sense to make these relatively complex but ultimately not-significantly-improving changes to the buffer cache code at this time. Specifically, I don't think renaming the BUF_WRITE/VOP_BWRITE or BUF_STRATEGY/DEV_STRATEGY stuff is worth doing at all, and while I agree that the idea of separting out the IO command (b_iocmd patch) is a good one, it will be much more effective to do it *AFTER* Kirk and I have separated out the functional interfaces because it will be mostly encapsulated in a single source module. At the current time the extensive nature of the changes have too high a potential for introducing new bugs in a system that has undergone significant debugging and testing and is pretty much known to work properly. -Matt Matthew Dillon <[EMAIL PROTECTED]> : :-- :Poul-Henning Kamp FreeBSD coreteam member :[EMAIL PROTECTED] "Real hackers run -current on their laptop." :FreeBSD -- It will take a long time before progress goes too far! : : :To Unsubscribe: send mail to [EMAIL PROTECTED] :with "unsubscribe freebsd-current" in the body of the message : To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message