Re: patches for test / review

2000-03-23 Thread Greg Lehey

On Monday, 20 March 2000 at 20:17:13 +0100, Poul-Henning Kamp wrote:
 In message [EMAIL PROTECTED], Matthew Dillon writes:

Well, let me tell you what the fuzzy goal is first and then maybe we
can work backwards.

Eventually all physical I/O needs a physical address.  The quickest
way to get to a physical address is to be given an array of vm_page_t's
(which can be trivially translated to physical addresses).

 Not all: PIO access to ATA needs virtual access.  RAID5 needs
 virtual access to calculate parity.

I'm not sure what you mean by "virtual access".  If you mean
file-related rather than partition-related, no: like the rest of
Vinum, RAID-5 uses only partition-related offsets.

What we want to do is to try to extend VMIO (aka the vm_page_t) all
the way through the I/O system - both VFS and DEV I/O, in order to
remove all the nasty back and forth translations.

 I agree, but some drivers need mapping we need to cater for those.
 They could simply call a vm_something(struct buf *) call which would
 map the pages and things would "just work".

 For RAID5 we have the opposite problem also: data is created which
 has only a mapped existance and the b_pages[] array is not
 populated.

Hmm.  I really need to check that I'm not missing something here.

Greg
--
Finger [EMAIL PROTECTED] for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-23 Thread Greg Lehey

On Monday, 20 March 2000 at 14:04:48 -0800, Matthew Dillon wrote:

 If a particular subsystem needs b_data, then that subsystem is obviously
 willing to take the virtual mapping / unmapping hit.  If you look at
 Greg's current code this is, in fact, what is occuring the critical
 path through the buffer cache in a heavily loaded system tends to require
 a KVA mapping *AND* a KVA unmapping on every buffer access (just that the
 unmappings tend to be for unrelated buffers).  The reason this occurs
 is because even with the larger amount of KVA we made available to the
 buffer cache in 4.x, there still isn't enough to leave mappings intact
 for long periods of time.  A 'systat -vm 1' will show you precisely
 what I mean (also sysctl -a | fgrep bufspace).

 So we will at least not be any worse off then we are now, and probably
 better off since many of the buffers in the new system will not have
 to be mapped.  For example, when vinum's RAID5 breaks up a request
 and issues a driveio() it passes a buffer which is assigned to b_data
 which must be translated (through page table lookups) to physical
 addresses anyway, so the fact that that vinum does not populate
 b_pages[] does *NOT* help it in the least.  It actually makes the job
 harder.

I think you may be confusing two things, though it doesn't seem to
make much difference.  driveio() is used only for accesses to the
configuration information; normal Vinum I/O goes via launch_requests()
(in vinumrequest.c).  And it's not just RAID-5 that breaks up a
request, it's any access that goes over more than one subdisk (even
concatenated plexes in exceptional cases).

Greg
--
Finger [EMAIL PROTECTED] for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-23 Thread Mike Smith

 Eventually all physical I/O needs a physical address.  The quickest
 way to get to a physical address is to be given an array of vm_page_t's
 (which can be trivially translated to physical addresses).
 
  Not all: PIO access to ATA needs virtual access.  RAID5 needs
  virtual access to calculate parity. 
 
 I'm not sure what you mean by "virtual access".  If you mean
 file-related rather than partition-related, no: like the rest of
 Vinum, RAID-5 uses only partition-related offsets.

No, the issue here has to do with the mapping of the data buffers.  If 
you're doing PIO, or otherwise manipulating the data in the driver before 
you give it to the hardware (eg. inside vinum) then you need the data 
buffers mapped into your virtual address space.

OTOH, if you're handing the buffer information to a busmaster device, 
you don't need this, instead you need the physical address of the buffer 
sections.

  For RAID5 we have the opposite problem also: data is created which
  has only a mapped existance and the b_pages[] array is not
  populated.
 
 Hmm.  I really need to check that I'm not missing something here.

The point here is that when you create RAID5 parity data, the buffer's 
physical addresses aren't filled in.

-- 
\\ Give a man a fish, and you feed him for a day. \\  Mike Smith
\\ Tell him he should learn how to fish himself,  \\  [EMAIL PROTECTED]
\\ and he'll hate you for a lifetime. \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-23 Thread Greg Lehey

On Monday, 20 March 2000 at 15:23:31 -0600, Dan Nelson wrote:
 In the last episode (Mar 20), Poul-Henning Kamp said:
 In message [EMAIL PROTECTED], Alfred Perlstein writes:
 * Poul-Henning Kamp [EMAIL PROTECTED] [000320 11:45] wrote:

 Before we redesign the clustering, I would like to know if we
 actually have any recent benchmarks which prove that clustering is
 overall beneficial ?

 Yes it is really benificial.

 I would like to see some numbers if you have them.

 For hardware RAID arrays that support it, if you can get the system to
 issue writes that are larger than the entire RAID-5 stripe size, your
 immensely slow "read parity/recalc parity/write parity/write data"
 operations turn into "recalc parity for entire stripe/write entire
 stripe".  RAID-5 magically achieves RAID-0 write speeds!  Given 32k
 granularity, and 8 disks per RAID group, you'll need a write
 size of 32*7 = 224k.  Given 64K granularity and 27 disks, that's 1.6MB.

 I have seen the jump in write throughput as I tuned an Oracle
 database's parameters on both Solaris and DEC Unix boxes.  Get Oracle
 to write blocks larger than a RAID-5 stripe, and it flies.

Agreed.  This is on the Vinum wishlist, but it comes at the expense of
reliability (how long do you wait to cluster?  What happens if the
system fails in between?).  In addition, for Vinum it needs to be done
before entering the hardware driver.

Greg
--
Finger [EMAIL PROTECTED] for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-23 Thread Greg Lehey

On Monday, 20 March 2000 at 22:52:59 +0100, Poul-Henning Kamp wrote:
 In message [EMAIL PROTECTED], Alfred Perlstein writes:
 * Poul-Henning Kamp [EMAIL PROTECTED] [000320 11:45] wrote:
 In message [EMAIL PROTECTED], Alfred Perlstein writes:

 Keeping the currect cluster code is a bad idea, if the drivers were
 taught how to traverse the linked list in the buf struct rather
 than just notice "a big buffer" we could avoid a lot of page
 twiddling and also allow for massive IO clustering (  64k )

 Before we redesign the clustering, I would like to know if we
 actually have any recent benchmarks which prove that clustering
 is overall beneficial ?

 Yes it is really benificial.

 I'm not talking about a redesign of the clustering code as much as
 making the drivers that take a callback from it actually traverse
 the 'union cluster_info' rather than relying on the system to fake
 the pages being contiguous via remapping.

 There's nothing wrong with the clustering algorithms, it's just the
 steps it has to take to work with the drivers.

 Hmm, try to keep vinum/RAID5 in the picture when you look at this
 code, it complicated matters a lot.

I don't think it's that relevant, in fact.

Greg
--
Finger [EMAIL PROTECTED] for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-23 Thread Greg Lehey

On Tuesday, 21 March 2000 at  9:29:56 -0800, Matthew Dillon wrote:

 I would think that track-caches and intelligent drives would gain
 much if not more of what clustering was designed to do gain.

 Hm. But I'd think that even with modern drives a smaller number of bigger
 I/Os is preferable over lots of very small I/Os. Or have I missed the point?

 As long as you do not blow away the drive's cache with your big I/O's,
 and as long as you actually use all the returned data, it's definitely
 more efficient to issue larger I/O's.

 If you generate requests that are too large - say over 1/4 the size of
 the drive's cache, the drive will not be able to optimize parallel
 requests as well.

I think that in the majority of cases there's no need to transfer more
than requested.  It could only apply to reads anyway, and the drive
cache probably has this data anyway.  In RAID adapters, it seems to
almost always be due to poor firmware design.  For regular files, it
might be an idea to set a flag to indicate whether read-ahead has any
hope of being useful (for example, on an ftp server the answer would
be "yes"; for index-sequential files or such the answer would normally
be "no".

Greg
--
Finger [EMAIL PROTECTED] for PGP public key
See complete headers for address and phone numbers


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-23 Thread Dan Nelson

In the last episode (Mar 23), Greg Lehey said:
 
 Agreed.  This is on the Vinum wishlist, but it comes at the expense of
 reliability (how long do you wait to cluster?  What happens if the
 system fails in between?).  In addition, for Vinum it needs to be done
 before entering the hardware driver.

For the simplest case, you can choose to optimize only when the user
sends a single huge write().  That way you don't have to worry about
caching dirty pages in vinum.  This is basically what the hardware
RAIDs I have do.  They'll only do the write optimization (they call it
"pipelining") if you actually send a single SCSI write request large
enough to span all the disks.  I don't know what would be required to
get our kernel to even be able to write blocks this big (what's the
upper limit on MAXPHYS)?

-- 
Dan Nelson
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-21 Thread Matthew Dillon


: Hm. But I'd think that even with modern drives a smaller number of bigger
: I/Os is preferable over lots of very small I/Os.
:
:Not necessarily. It depends upon overhead costs per-i/o. With larger I/Os, you
:do pay in interference costs (you can't transfer data for request N because
:the 256Kbytes of request M is still in the pipe).

This problem has scaled over the last few years.  With 5 MB/sec SCSI
busses it was a problem.  With 40, 80, and 160 MB/sec it isn't as big
an issue any more.

256K @ 40 MBytes/sec = 6.25 mS.
256K @ 80 MBytes/sec = 3.125 mS.

When you add in write-decoupling (take softupdates, for example), the
issue become even less of a problem.

The biggest single item that does not scale well is command/response 
overhead.  I think it has been successfully argued (but I forgot who
made the point) that 64K is not quite into the sweet spot - that 256K
is closer to the mark.  But one has to be careful to only issue large
requests for things that are actually going to be used.  If you read
256K but only use 8K of it, you just wasted a whole lot of cpu and bus
bandwidth.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-21 Thread Matthew Dillon

: 
: I would think that track-caches and intelligent drives would gain
: much if not more of what clustering was designed to do gain.
:
:Hm. But I'd think that even with modern drives a smaller number of bigger
:I/Os is preferable over lots of very small I/Os. Or have I missed the point?
:
:-- 
:Wilko BulteArnhem, The Netherlands   
:http://www.tcja.nl The FreeBSD Project: http://www.freebsd.org

As long as you do not blow away the drive's cache with your big I/O's,
and as long as you actually use all the returned data, it's definitely 
more efficient to issue larger I/O's.

If you generate requests that are too large - say over 1/4 the size of
the drive's cache, the drive will not be able to optimize parallel 
requests as well.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-21 Thread Wilko Bulte

On Mon, Mar 20, 2000 at 11:54:58PM -0800, Matthew Jacob wrote:
  
  Hm. But I'd think that even with modern drives a smaller number of bigger
  I/Os is preferable over lots of very small I/Os.
 
 Not necessarily. It depends upon overhead costs per-i/o. With larger I/Os, you
 do pay in interference costs (you can't transfer data for request N because
 the 256Kbytes of request M is still in the pipe).

OK. 256K might be a bit on the high side. 

-- 
Wilko Bulte Arnhem, The Netherlands   
http://www.tcja.nl  The FreeBSD Project: http://www.freebsd.org


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-21 Thread Wilko Bulte

On Tue, Mar 21, 2000 at 09:29:56AM -0800, Matthew Dillon wrote:
 : 
 : I would think that track-caches and intelligent drives would gain
 : much if not more of what clustering was designed to do gain.
 :
 :Hm. But I'd think that even with modern drives a smaller number of bigger
 :I/Os is preferable over lots of very small I/Os. Or have I missed the point?

 As long as you do not blow away the drive's cache with your big I/O's,
 and as long as you actually use all the returned data, it's definitely 
 more efficient to issue larger I/O's.

Prefetching data that is never used is obviously a waste. 256K might be a
bit big, I was thinking of something like 64-128Kb 

Drive caches tend to be 0.5-1Mbyte (on SCSI disks) for modern drives. 

I happen to hate write-caching on disk drives so I did not consider that as
a factor.

 If you generate requests that are too large - say over 1/4 the size of
 the drive's cache, the drive will not be able to optimize parallel 
 requests as well.

True.

-- 
Wilko Bulte Arnhem, The Netherlands   
http://www.tcja.nl  The FreeBSD Project: http://www.freebsd.org


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-21 Thread Rodney W. Grimes

 On Tue, Mar 21, 2000 at 09:29:56AM -0800, Matthew Dillon wrote:
  : 
  : I would think that track-caches and intelligent drives would gain
  : much if not more of what clustering was designed to do gain.
  :
  :Hm. But I'd think that even with modern drives a smaller number of bigger
  :I/Os is preferable over lots of very small I/Os. Or have I missed the point?
 
  As long as you do not blow away the drive's cache with your big I/O's,
  and as long as you actually use all the returned data, it's definitely 
  more efficient to issue larger I/O's.
 
 Prefetching data that is never used is obviously a waste. 256K might be a
 bit big, I was thinking of something like 64-128Kb 
 
 Drive caches tend to be 0.5-1Mbyte (on SCSI disks) for modern drives. 

Your a bit behind the times with that set of numbers for modern SCSI
drives.  It is now 1 to 16 Mbyte of cache, with 2 and 4Mbyte being the
most common.


-- 
Rod Grimes - KD7CAX @ CN85sl - (RWG25)   [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-21 Thread Wilko Bulte

On Tue, Mar 21, 2000 at 01:14:45PM -0800, Rodney W. Grimes wrote:
  On Tue, Mar 21, 2000 at 09:29:56AM -0800, Matthew Dillon wrote:
   : 
   : I would think that track-caches and intelligent drives would gain
   : much if not more of what clustering was designed to do gain.
   :
   :Hm. But I'd think that even with modern drives a smaller number of bigger
   :I/Os is preferable over lots of very small I/Os. Or have I missed the point?
  
   As long as you do not blow away the drive's cache with your big I/O's,
   and as long as you actually use all the returned data, it's definitely 
   more efficient to issue larger I/O's.
  
  Prefetching data that is never used is obviously a waste. 256K might be a
  bit big, I was thinking of something like 64-128Kb 
  
  Drive caches tend to be 0.5-1Mbyte (on SCSI disks) for modern drives. 
 
 Your a bit behind the times with that set of numbers for modern SCSI
 drives.  It is now 1 to 16 Mbyte of cache, with 2 and 4Mbyte being the
 most common.

Your drives are more modern than mine ;-) What drive has 16 Mb? Curious
here..

-- 
Wilko Bulte Arnhem, The Netherlands   
http://www.tcja.nl  The FreeBSD Project: http://www.freebsd.org


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-21 Thread Rodney W. Grimes

 On Tue, Mar 21, 2000 at 01:14:45PM -0800, Rodney W. Grimes wrote:
   On Tue, Mar 21, 2000 at 09:29:56AM -0800, Matthew Dillon wrote:
: 
: I would think that track-caches and intelligent drives would gain
: much if not more of what clustering was designed to do gain.
:
:Hm. But I'd think that even with modern drives a smaller number of bigger
:I/Os is preferable over lots of very small I/Os. Or have I missed the point?
   
As long as you do not blow away the drive's cache with your big I/O's,
and as long as you actually use all the returned data, it's definitely 
more efficient to issue larger I/O's.
   
   Prefetching data that is never used is obviously a waste. 256K might be a
   bit big, I was thinking of something like 64-128Kb 
   
   Drive caches tend to be 0.5-1Mbyte (on SCSI disks) for modern drives. 
  
  Your a bit behind the times with that set of numbers for modern SCSI
  drives.  It is now 1 to 16 Mbyte of cache, with 2 and 4Mbyte being the
  most common.
 
 Your drives are more modern than mine ;-) What drive has 16 Mb? Curious
 here..

Seagates latest and greatest drives have a 4MB cache standard and an option
for 16MB.  These are 10K RPM chetta drives.  


-- 
Rod Grimes - KD7CAX @ CN85sl - (RWG25)   [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Matthew Dillon


:I have two patches up for test at http://phk.freebsd.dk/misc
:
:I'm looking for reviews and tests, in particular vinum testing
:would be nice since Grog is quasi-offline at the moment.
:
:Poul-Henning
:
:2317 BWRITE-STRATEGY.patch
:
:This patch is machine generated except for the ccd.c and buf.h
:parts.
:
:Rename existing BUF_STRATEGY to DEV_STRATEGY
:substitute BUF_WRITE(foo) for VOP_BWRITE(foo-b_vp, foo);
:substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo-b_vp, foo);
:
:Please test  review.
:
:
:2317 b_iocmd.patch
:
:This patch removes B_READ, B_WRITE and B_FREEBUF and replaces
:them with a new field in struct buf: b_iocmd.
:
:B_WRITE was bogusly defined as zero giving rise to obvious
:coding mistakes and a lot of code implicitly knew this.
:
:This patch also eliminates the redundant flag B_CALL, it can
:just as efficiently be done by comparing b_iodone to NULL.
:
:Should you get a panic or drop into the debugger, complaining about
:"b_iocmd", don't continue, it is likely to write where it should
:have read.
:
:Please test  review.

Kirk and I have already mapped out a plan to drastically update
the buffer cache API which will encapsulate much of the state within
the buffer cache module.  I don't think it makes much sense to make
these relatively complex but ultimately not-significantly-improving 
changes to the buffer cache code at this time.  Specifically, I
don't think renaming the BUF_WRITE/VOP_BWRITE or BUF_STRATEGY/DEV_STRATEGY
stuff is worth doing at all, and while I agree that the idea of separting
out the IO command (b_iocmd patch) is a good one, it will be much more
effective to do it *AFTER* Kirk and I have separated out the functional 
interfaces because it will be mostly encapsulated in a single source
module.  At the current time the extensive nature of the changes have
too high a potential for introducing new bugs in a system that has 
undergone significant debugging and testing and is pretty much known to
work properly.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]

:
:--
:Poul-Henning Kamp FreeBSD coreteam member
:[EMAIL PROTECTED]   "Real hackers run -current on their laptop."
:FreeBSD -- It will take a long time before progress goes too far!
:
:
:To Unsubscribe: send mail to [EMAIL PROTECTED]
:with "unsubscribe freebsd-current" in the body of the message
:



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Poul-Henning Kamp


Kirk and I have already mapped out a plan to drastically update
the buffer cache API which will encapsulate much of the state within
the buffer cache module.

Sounds good.  Combined with my stackable BIO plans that sounds like
a really great win for FreeBSD.

--
Poul-Henning Kamp FreeBSD coreteam member
[EMAIL PROTECTED]   "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Matthew Dillon


:
:
:Kirk and I have already mapped out a plan to drastically update
:the buffer cache API which will encapsulate much of the state within
:the buffer cache module.
:
:Sounds good.  Combined with my stackable BIO plans that sounds like
:a really great win for FreeBSD.
:
:--
:Poul-Henning Kamp FreeBSD coreteam member
:[EMAIL PROTECTED]   "Real hackers run -current on their laptop."

I think so.  I can give -current a quick synopsis of the plan but I've
probably forgotten some of the bits (note: the points below are not
in any particular order):

Probably the most important thing to keep in mind when reading over
this list is to note that nearly all the changes being contemplated 
can be implemented without breaking current interfaces, and the current
interfaces can then be shifted over to the new interfaces one subsystem
at a time (shift, test, shift, test, shift test) until none of the 
original use remains.  At the point the support for the original API
can be removed.

* make VOP locking calls recursive.  That is, to obtain exclusive
  recursive locks by default rather then non-recursive locks.

* cleanup all VOP_*() interfaces in regards to the special handling
  of the case where a locked vnode is passed, a locked vnode is
  returned, and the returned vnode happens to wind up being the same
  as the locked vnode (Allow a double-locked vnode on return and get
  rid of all the stupid code that juggles locks around to get around
  the non-recursive nature of current exclusive locks).

  VOP_LOOKUP is the most confused interface that needs cleaning up.

  With only a small amount of additional work, mainly KASERT's to
  catch potential problems, we should be able to turn on exclusive 
  recursion.  The VOP_*() interfaces will have to be fixed one at
  a time with VOP_LOOKUP topping the list.

* Make exclusive buffer cache locks recursive.  Kirk has completed all
  the preliminary work on this and we should be able to just turn it
  on.  We just haven't gotten around to it (and the release got in the
  way).  This is necessary to support up and coming softupdates mechanisms
  (e.g. background fsck, snapshot dumps) as well as better-support device
  recursion.

* Cleanup the buffer cache API (bread(), BUF_STRATEGY(), and so forth).
  Specifically, split out the call functionality such that the buffer
  cache can determine whether a buffer being obtained is going to be
  used for reading or writing.  At the moment we don't know if the system
  is going to dirty a buffer until after the fact and this has caused a
  lot of pain in regards to dealing with low-memory situations.

  getblk() - getblk_sh() and getblk_ex()

Obtain bp without issuing I/O, getting either a shared or exclusive
lock on the bp.  With a shared lock you are allowed to issue READ
I/O but you are not allowed to modify the contents of the buffer.
With an exclusive lock you are allowed to issue both READ and WRITE
I/O and you can modify the contents of the buffer.

  bread()  - bread_sh() and bread_ex()

Obtain and validate (issue read I/O as appropriate) a bp.  bread_sh()
allows a buffer to be accessed but not modified or rewritten.
bread_ex() allows a buffer to be modified and written.

* Many uses of the buffer cache in the critical path do not actually 
  require the buffer data to be mapped into KVM.  For example, a number 
  of I/O devices need only the b_pages[] array and do not need a b_data
  mapping.  It would not take a huge amount of work to adjust the 
  uiomove*() interfaces appropriately.

  The general plan is to try remove whole portions of the current buffer
  cache funcitonality and shift them into the new vm_pager_*() API.  That
  is, to operate on VM Object's directly whenever possible.

  The idea for the buffer cache is to shift its functionality to one that
  is solely used to issue device I/O and to keep track of dirty areas for
  proper sequencing of I/O (e.g. softupdate's use of the buffer cache 
  to placemark I/O will not change).  The core buffer cache code would
  no longer map things to KVM with b_data, that functionality would be
  shifted to the VM Object vm_pager_*() API.  The buffer cache would
  continue to use the b_pages[] array mechanism to collect pages for I/O,
  for clustering, and so forth.


  It should be noted that the buffer cache's perceived slowness is almost
  entirely due to all the KVM manipulation it does for b_data, and that
  such manipulate is not necessary for the vast majority of the critical
  path:  Reading and writing file data (can run through the VM Object
  API), and issuing I/O (can avoid b_data KVM mappings entirely).  

  Meta data, such as 

Re: patches for test / review

2000-03-20 Thread Poul-Henning Kamp

In message [EMAIL PROTECTED], Matthew Dillon writes:

I think so.  I can give -current a quick synopsis of the plan but I've
probably forgotten some of the bits (note: the points below are not
in any particular order):

Thanks for the sketch.  It sounds really good.

Is it your intention that drivers which cannot work from the b_pages[]
array will call to map them into VM, or will a flag on the driver/dev_t/
whatever tell the generic code that it should be mapped before calling
the driver ?

What about unaligned raw transfers, say a raw CD read of 2352 bytes
from userland ?  I pressume we will need an offset into the first 
page for that ?

One thing I would like to see is for the buffers to know how to
write themselves.  There is nothing which mandates that a buffer
be backed by a disk-like device, and there are uses for buffers
which aren't.

Being able to say bp-bop_write(bp) rather than bwrite(bp) would
allow that flexibility.  Kirk already introduced a bio_ops[] but
made it global for now, that should be per buffer and have all the
bufferops in it, (except for the onces which instantiate the buffer).

If we had this, pseudo filesystems like DEVFS could use UFS for
much of their naming management.  This is currently impossible.

--
Poul-Henning Kamp FreeBSD coreteam member
[EMAIL PROTECTED]   "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Matthew Dillon

:Thanks for the sketch.  It sounds really good.
:
:Is it your intention that drivers which cannot work from the b_pages[]
:array will call to map them into VM, or will a flag on the driver/dev_t/
:whatever tell the generic code that it should be mapped before calling
:the driver ?
:
:What about unaligned raw transfers, say a raw CD read of 2352 bytes
:from userland ?  I pressume we will need an offset into the first 
:page for that ?

Well, let me tell you what the fuzzy goal is first and then maybe we
can work backwards.

Eventually all physical I/O needs a physical address.  The quickest
way to get to a physical address is to be given an array of vm_page_t's
(which can be trivially translated to physical addresses).

The buffer cache already has such an array, called b_pages[].

Any I/O that runs through b_data or runs through a uio must eventually
be cut up into blocks of contiguous physical addresses.

What we want to do is to try to extend VMIO (aka the vm_page_t) all
the way through the I/O system - both VFS and DEV I/O, in order to 
remove all the nasty back and forth translations.

In regards to raw devices I originally envisioned having two BUF_*()
strategy calls - one that uses a page array, and one that uses b_data.
But your idea below - using bio_ops[], is much better.

In regards to odd block sizes and offsets the real question is whether
an attempt should be made to translate UIO ops into buffer cache b_pages[]
ops directly, maintaining offsets and odd sizes, or whether we should 
back-off to a copy scheme where we allocate b_pages[] for oddly sized 
uio's and then copy the data to the uio buffer.

My personal preference is to not pollute the VMIO page-passing mechanism
with all sorts of fields to handle weird offsets and sizes.  Instead we
ought to take the copy hit for the non-optimal cases, and simply fix all
the programs doing the accesses to pass optimally aligned buffers.  For
example, for a raw-I/O on an audio CD track you would pass a page-aligned
buffer with a request size of at least a page (e.g. 4K on IA32) in your
read(), and the raw device would return '2352' as the result and the
returned data would be page-aligned.

This would allow the system call to use the b_pages[] strategy entry
point even for devices with odd sizes and still get optimal (zero-copy)
operation.  If the user passes a non-aligned (or mulitiple of a page-sized)
buffer, the system takes the copy hit in order to keep the lower level
I/O interface clean.

:One thing I would like to see is for the buffers to know how to
:write themselves.  There is nothing which mandates that a buffer
:be backed by a disk-like device, and there are uses for buffers
:which aren't.
:
:Being able to say bp-bop_write(bp) rather than bwrite(bp) would
:allow that flexibility.  Kirk already introduced a bio_ops[] but
:made it global for now, that should be per buffer and have all the
:bufferops in it, (except for the onces which instantiate the buffer).
:
:If we had this, pseudo filesystems like DEVFS could use UFS for
:much of their naming management.  This is currently impossible.
:
:--
:Poul-Henning Kamp FreeBSD coreteam member
:[EMAIL PROTECTED]   "Real hackers run -current on their laptop."
:FreeBSD -- It will take a long time before progress goes too far!

I like the idea of dynamicizing bio_ops[] and using that to issue 
struct buf based I/O.  It fits very nicely into the general idea of
separating the VFS and DEV I/O interfaces (they are currently hopelessly
intertwined).

Actually, the more I think about it the more I'm willing to just say
to hell with it and start doing all the changes all at once, in parallel,
including the two patches you wanted reviewed earlier (though I would
request that you not combine disparate patch funcitonalities into a 
single patch set).  I agree with Julian on the point about IPSEC.

Dynamicizing bio_ops[] ought to be trivial.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Alfred Perlstein

* Matthew Dillon [EMAIL PROTECTED] [000320 10:01] wrote:
 
 :
 :
 :Kirk and I have already mapped out a plan to drastically update
 :the buffer cache API which will encapsulate much of the state within
 :the buffer cache module.
 :
 :Sounds good.  Combined with my stackable BIO plans that sounds like
 :a really great win for FreeBSD.
 :
 :--
 :Poul-Henning Kamp FreeBSD coreteam member
 :[EMAIL PROTECTED]   "Real hackers run -current on their laptop."
 
 I think so.  I can give -current a quick synopsis of the plan but I've
 probably forgotten some of the bits (note: the points below are not
 in any particular order):



 * Cleanup the buffer cache API (bread(), BUF_STRATEGY(), and so forth).
   Specifically, split out the call functionality such that the buffer
   cache can determine whether a buffer being obtained is going to be
   used for reading or writing.  At the moment we don't know if the system
   is going to dirty a buffer until after the fact and this has caused a
   lot of pain in regards to dealing with low-memory situations.
 
   getblk() - getblk_sh() and getblk_ex()
 
   Obtain bp without issuing I/O, getting either a shared or exclusive
   lock on the bp.  With a shared lock you are allowed to issue READ
   I/O but you are not allowed to modify the contents of the buffer.
   With an exclusive lock you are allowed to issue both READ and WRITE
   I/O and you can modify the contents of the buffer.
 
   bread()  - bread_sh() and bread_ex()
 
   Obtain and validate (issue read I/O as appropriate) a bp.  bread_sh()
   allows a buffer to be accessed but not modified or rewritten.
   bread_ex() allows a buffer to be modified and written.

This seems to allow for expressing intent to write to buffers,
which would be an excellent place to cow the pages 'in software'
rather than obsd's way of using cow'd pages to accomplish the same
thing.

I'm not sure if you remeber what I brought up at BAFUG, but I'd
like to see something along the lines of BX_BKGRDWRITE that Kirk
is using for the bitmaps blocks in softupdates to be enabled on a
system wide basis.  That way rewriting data that has been sent to
the driver isn't blocked and at the same time we don't need to page
protect during every strategy call.

I may have misunderstood your intent, but using page protections
on each IO would seem to introduce a lot of performance issues that
the rest of these points are all trying to get rid of.

   The idea for the buffer cache is to shift its functionality to one that
   is solely used to issue device I/O and to keep track of dirty areas for
   proper sequencing of I/O (e.g. softupdate's use of the buffer cache 
   to placemark I/O will not change).  The core buffer cache code would
   no longer map things to KVM with b_data, that functionality would be
   shifted to the VM Object vm_pager_*() API.  The buffer cache would
   continue to use the b_pages[] array mechanism to collect pages for I/O,
   for clustering, and so forth.

Keeping the currect cluster code is a bad idea, if the drivers were
taught how to traverse the linked list in the buf struct rather
than just notice "a big buffer" we could avoid a lot of page
twiddling and also allow for massive IO clustering (  64k ) because
we won't be limited by the size of the b_pages[] array for our
upper bound on the amount of buffers we can issue effectively a
scatter/gather on (since the drivers must VTOPHYS them anyway).

To realize my "nfs super commit" stuff all we'd need to do is make
the max cluster size something like 0-1 and instantly get an almost
unbounded IO burst.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Poul-Henning Kamp

In message [EMAIL PROTECTED], Matthew Dillon writes:

Well, let me tell you what the fuzzy goal is first and then maybe we
can work backwards.

Eventually all physical I/O needs a physical address.  The quickest
way to get to a physical address is to be given an array of vm_page_t's
(which can be trivially translated to physical addresses).

Not all:
PIO access to ATA needs virtual access.
RAID5 needs virtual access to calculate parity.

What we want to do is to try to extend VMIO (aka the vm_page_t) all
the way through the I/O system - both VFS and DEV I/O, in order to 
remove all the nasty back and forth translations.

I agree, but some drivers need mapping we need to cater for those.
They could simply call a vm_something(struct buf *) call which would
map the pages and things would "just work".

For RAID5 we have the opposite problem also:  data is created which
has only a mapped existance and the b_pages[] array is not populated.

In regards to odd block sizes and offsets the real question is whether
an attempt should be made to translate UIO ops into buffer cache b_pages[]
ops directly, maintaining offsets and odd sizes, or whether we should 
back-off to a copy scheme where we allocate b_pages[] for oddly sized 
uio's and then copy the data to the uio buffer.

I don't know of any non DEV_BSIZE aligned apps that are sufficiently 
high-profile and high-performance to justify too much code to avoid
a copy operation, so I guess that is OK.

My personal preference is to not pollute the VMIO page-passing mechanism
with all sorts of fields to handle weird offsets and sizes.  Instead we
ought to take the copy hit for the non-optimal cases, and simply fix all
the programs doing the accesses to pass optimally aligned buffers.  For
example, for a raw-I/O on an audio CD track you would pass a page-aligned
buffer with a request size of at least a page (e.g. 4K on IA32) in your
read(), and the raw device would return '2352' as the result and the
returned data would be page-aligned.

No protest from here.  Encouraging people to think about their data
and the handling of them will always have my vote :-)


--
Poul-Henning Kamp FreeBSD coreteam member
[EMAIL PROTECTED]   "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Poul-Henning Kamp

In message [EMAIL PROTECTED], Alfred Perlstein writes:

Keeping the currect cluster code is a bad idea, if the drivers were
taught how to traverse the linked list in the buf struct rather
than just notice "a big buffer" we could avoid a lot of page
twiddling and also allow for massive IO clustering (  64k ) 

Before we redesign the clustering, I would like to know if we
actually have any recent benchmarks which prove that clustering
is overall beneficial ?

I would think that track-caches and intelligent drives would gain
much if not more of what clustering was designed to do gain.

I seem to remember Bruce saying that clustering could even hurt ?

--
Poul-Henning Kamp FreeBSD coreteam member
[EMAIL PROTECTED]   "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Alfred Perlstein

* Poul-Henning Kamp [EMAIL PROTECTED] [000320 11:45] wrote:
 In message [EMAIL PROTECTED], Alfred Perlstein writes:
 
 Keeping the currect cluster code is a bad idea, if the drivers were
 taught how to traverse the linked list in the buf struct rather
 than just notice "a big buffer" we could avoid a lot of page
 twiddling and also allow for massive IO clustering (  64k ) 
 
 Before we redesign the clustering, I would like to know if we
 actually have any recent benchmarks which prove that clustering
 is overall beneficial ?

Yes it is really benificial.

I'm not talking about a redesign of the clustering code as much as
making the drivers that take a callback from it actually traverse
the 'union cluster_info' rather than relying on the system to fake
the pages being contiguous via remapping.

There's nothing wrong with the clustering algorithms, it's just the
steps it has to take to work with the drivers.

 
 I would think that track-caches and intelligent drives would gain
 much if not more of what clustering was designed to do gain.
 
 I seem to remember Bruce saying that clustering could even hurt ?

Yes because of the gyrations it needs to go through to maintain backward
compatibility for devices that want to see "one big buffer" rather than
simply follow a linked list of io operations.

Not true, at least for 'devices' like NFS where large IO ops issued
save milliseconds in overhead.  Unless each device was to re-buffer
IO (which is silly) or scan the vp passed to it (violating the
adstraction and being really scary like my flopped super-commit
stuff for NFS) it would make NFS performance even worse for doing
commits.

Without clustering you'd have to issue a commit RPC for each 8k block
With the current clustering you have to issue a commit for each 64k
block
With an unbounded linked list, well there is only the limit that the
filesystem asks for.

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



I/O clustering, Re: patches for test / review

2000-03-20 Thread Alfred Perlstein

* Poul-Henning Kamp [EMAIL PROTECTED] [000320 12:03] wrote:
 In message [EMAIL PROTECTED], Alfred Perlstein writes:
 * Poul-Henning Kamp [EMAIL PROTECTED] [000320 11:45] wrote:
  In message [EMAIL PROTECTED], Alfred Perlstein writes:
  
  Keeping the currect cluster code is a bad idea, if the drivers were
  taught how to traverse the linked list in the buf struct rather
  than just notice "a big buffer" we could avoid a lot of page
  twiddling and also allow for massive IO clustering (  64k ) 
  
  Before we redesign the clustering, I would like to know if we
  actually have any recent benchmarks which prove that clustering
  is overall beneficial ?
 
 Yes it is really benificial.
 
 I would like to see some numbers if you have them.

No I don't have numbers.

Committing a 64k block would require 8 times the overhead of bundling
up the RPC as well as transmission and reply, it may be possible
to pipeline these commits because you don't really need to wait
for one to complete before issueing another request, but it's still
8x the amount of traffic.

You also complicate and penalize drivers because not all drivers
can add an IO request to an already started transaction, those
devices will need to start new transactions for each buffer instead
of bundling up the list and passing it all along.

Maybe I'm missing something.

Is there something to provide a clean way to cluster IO, can you
suggest something that won't have this sort of impact on NFS (and
elsewhere) if the clustering code was removed?

Bruce, what part of the clustering code makes you think of it as
hurting us, I thought it was mapping code?

 --
 Poul-Henning Kamp FreeBSD coreteam member
 [EMAIL PROTECTED]   "Real hackers run -current on their laptop."
 FreeBSD -- It will take a long time before progress goes too far!

-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: I/O clustering, Re: patches for test / review

2000-03-20 Thread Poul-Henning Kamp

In message [EMAIL PROTECTED], Alfred Perlstein writes:

  Before we redesign the clustering, I would like to know if we
  actually have any recent benchmarks which prove that clustering
  is overall beneficial ?
 
 Yes it is really benificial.
 
 I would like to see some numbers if you have them.

No I don't have numbers.

Committing a 64k block would require 8 times the overhead of bundling
up the RPC as well as transmission and reply, it may be possible
to pipeline these commits because you don't really need to wait
for one to complete before issueing another request, but it's still
8x the amount of traffic.

I agree that it is obvious for NFS, but I don't see it as being
obvious at all for (modern) disks, so for that case I would like
to see numbers.

If running without clustering is just as fast for modern disks,
I think the clustering needs rethought.

--
Poul-Henning Kamp FreeBSD coreteam member
[EMAIL PROTECTED]   "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: I/O clustering, Re: patches for test / review

2000-03-20 Thread Matthew Dillon


:
:* Poul-Henning Kamp [EMAIL PROTECTED] [000320 12:03] wrote:
: In message [EMAIL PROTECTED], Alfred Perlstein writes:
: * Poul-Henning Kamp [EMAIL PROTECTED] [000320 11:45] wrote:
:  In message [EMAIL PROTECTED], Alfred Perlstein writes:
:  
:  Keeping the currect cluster code is a bad idea, if the drivers were
:  taught how to traverse the linked list in the buf struct rather
:  than just notice "a big buffer" we could avoid a lot of page
:  twiddling and also allow for massive IO clustering (  64k ) 
:  
:  Before we redesign the clustering, I would like to know if we
:  actually have any recent benchmarks which prove that clustering
:  is overall beneficial ?
: 
: Yes it is really benificial.
: 
: I would like to see some numbers if you have them.
:
:No I don't have numbers.
:
:Committing a 64k block would require 8 times the overhead of bundling
:up the RPC as well as transmission and reply, it may be possible
:to pipeline these commits because you don't really need to wait

Clustering is extremely beneficial.  DG and I and I think even BDE and
Tor have done a lot of random tests in that area.  I did a huge amount
of clustering related work while optimizing NFSv3 and fixing up the
random/sequential I/O heuristics for 4.0 (for both NFS and UFS).

The current clustering code does a pretty good job and I would hesitate
to change it at this time.  The only real overhead comes from the KVA
pte mappings for b_data in the pbuf that the clustering (and other)
code uses.  I do not think that redoing the clustering will have 
a beneficial result until *after* we optimize the I/O path as per
my previous posting.

Once we optimize the I/O path to make it more VM Object centric, it
will make it a whole lot easier to remove *ALL* the artificial I/O size
limitations.

-Matt



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Dan Nelson

In the last episode (Mar 20), Poul-Henning Kamp said:
 In message [EMAIL PROTECTED], Alfred Perlstein writes:
 * Poul-Henning Kamp [EMAIL PROTECTED] [000320 11:45] wrote:
  
  Before we redesign the clustering, I would like to know if we
  actually have any recent benchmarks which prove that clustering is
  overall beneficial ?
 
 Yes it is really benificial.
 
 I would like to see some numbers if you have them.

For hardware RAID arrays that support it, if you can get the system to
issue writes that are larger than the entire RAID-5 stripe size, your
immensely slow "read parity/recalc parity/write parity/write data"
operations turn into "recalc parity for entire stripe/write entire
stripe".  RAID-5 magically achieves RAID-0 write speeds!  Given 32k
granularity, and 8 disks per RAID group, you'll need a write
size of 32*7 = 224k.  Given 64K granularity and 27 disks, that's 1.6MB.

I have seen the jump in write throughput as I tuned an Oracle
database's parameters on both Solaris and DEC Unix boxes.  Get Oracle
to write blocks larger than a RAID-5 stripe, and it flies.

-- 
Dan Nelson
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: I/O clustering, Re: patches for test / review

2000-03-20 Thread Matthew Dillon

:
:I agree that it is obvious for NFS, but I don't see it as being
:obvious at all for (modern) disks, so for that case I would like
:to see numbers.
:
:If running without clustering is just as fast for modern disks,
:I think the clustering needs rethought.
:
:   Depends on the type of disk drive and how it is configured. Some drives
:perform badly (skip a revolution) with back-to-back writes. In all cases,
:without aggregation of blocks, you pay the extra cost of additional interrupts
:and I/O rundowns, which can be a significant factor. Also, unless the blocks
:were originally written by the application in a chunk, they will likely be
:mixed with blocks to varying locations, in which case for drives without
:write caching enabled, you'll have additional seeks to write the blocks out.
:Things like this don't show up when doing simplistic sequential write tests.
:
:-DG
:
:David Greenman
:Co-founder/Principal Architect, The FreeBSD Project - http://www.freebsd.org

   I have an excellent example of this related to NFS.  It's still applicable
   even though the NFS point has already been conceeded.

   As part of the performance enhancements package I extended the sequential
   detection heuristic to the NFS server side code and turned on clustering.
   On the server, mind you, not the client.

   Read performance went up drastically.  My 100BaseTX network instantly
   maxed out and, more importantly, the server side cpu use went down
   drastically.  Here is the relevant email from my archives describing the
   performance gains:

:From:   dillon
:To:   Alfred Perlstein [EMAIL PROTECTED]
:Cc:   Alan Cox [EMAIL PROTECTED], Julian Elischer [EMAIL PROTECTED]
:Date: Sun Dec 12 10:11:06 1999
:
:...
:This proposed patch allows us to maintain a sequential read heuristic
:on the server side.  I noticed that the NFS server side reads only 8K
:blocks from the physical media even when the NFS client is reading a
:file sequentially.
:
:With this heuristic in place I can now get 9.5 to 10 MBytes/sec reading
:over NFS on a 100BaseTX network, and the server winds up being 80% 
:idle.  Under -stable the same test runs 72% idle and 8.4 MBytes/sec.

This is in spite of the fact that in this sequential test the hard
drives were caching the read data ahead anyway.  The reduction in
command/response/interrupt overhead on the server by going from 8K read
I/O's to 64K read I/O's in the sequential case made an obvious beneficial
impact on the cpu.  I almost halved the cpu overhead on the server!

So while on-disk caching makes a lot of sense, it is in no way able
to replace software clustering.  Having both working together is a
killer combination.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Matthew Dillon


:
:In message [EMAIL PROTECTED], Matthew Dillon writes:
:
:Well, let me tell you what the fuzzy goal is first and then maybe we
:can work backwards.
:
:Eventually all physical I/O needs a physical address.  The quickest
:way to get to a physical address is to be given an array of vm_page_t's
:(which can be trivially translated to physical addresses).
:
:Not all:
:PIO access to ATA needs virtual access.
:RAID5 needs virtual access to calculate parity.

... which means that the initial implementation for PIO and RAID5
utilizes the mapped-buffer bioops interface rather then the b_pages[]
bioops interface.

But here's the point:  We need to require that all entries *INTO* the
bio system start with at least b_pages[] and then generate b_data only
when necessary.  If a particular device needs a b_data mapping, it
can get one, but I think it would be a huge mistake to allow entry into
the device subsystem to utilize *either* a b_data mapping *or* a 
b_pages[] mapping.  Big mistake.  There has to be a lowest common
denominator that the entire system can count on and it pretty much has
to be an array of vm_page_t's.

If a particular subsystem needs b_data, then that subsystem is obviously
willing to take the virtual mapping / unmapping hit.  If you look at 
Greg's current code this is, in fact, what is occuring the critical
path through the buffer cache in a heavily loaded system tends to require
a KVA mapping *AND* a KVA unmapping on every buffer access (just that the
unmappings tend to be for unrelated buffers).  The reason this occurs
is because even with the larger amount of KVA we made available to the
buffer cache in 4.x, there still isn't enough to leave mappings intact
for long periods of time.  A 'systat -vm 1' will show you precisely
what I mean (also sysctl -a | fgrep bufspace).  

So we will at least not be any worse off then we are now, and probably
better off since many of the buffers in the new system will not have
to be mapped.  For example, when vinum's RAID5 breaks up a request
and issues a driveio() it passes a buffer which is assigned to b_data
which must be translated (through page table lookups) to physical
addresses anyway, so the fact that that vinum does not populate 
b_pages[] does *NOT* help it in the least.  It actually makes the job
harder.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]

:--
:Poul-Henning Kamp FreeBSD coreteam member
:[EMAIL PROTECTED]   "Real hackers run -current on their laptop."
:FreeBSD -- It will take a long time before progress goes too far!
:



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Alfred Perlstein

* Matthew Dillon [EMAIL PROTECTED] [000320 14:18] wrote:
 
 :lock on the bp.  With a shared lock you are allowed to issue READ
 :I/O but you are not allowed to modify the contents of the buffer.
 :With an exclusive lock you are allowed to issue both READ and WRITE
 :I/O and you can modify the contents of the buffer.
 : 
 :   bread()  - bread_sh() and bread_ex()
 : 
 :Obtain and validate (issue read I/O as appropriate) a bp.  bread_sh()
 :allows a buffer to be accessed but not modified or rewritten.
 :bread_ex() allows a buffer to be modified and written.
 :
 :This seems to allow for expressing intent to write to buffers,
 :which would be an excellent place to cow the pages 'in software'
 :rather than obsd's way of using cow'd pages to accomplish the same
 :thing.
 
 Yes, absolutely.  DG (if I remember right) is rabid about not taking
 VM faults while sitting in the kernel and I tend to agree with him that
 it's a cop-out to use VM faults in the kernel to get around those
 sorts of problems.

ok, so we're on the same page then. :)

 
 :I'm not sure if you remeber what I brought up at BAFUG, but I'd
 :like to see something along the lines of BX_BKGRDWRITE that Kirk
 :is using for the bitmaps blocks in softupdates to be enabled on a
 :system wide basis.  That way rewriting data that has been sent to
 :the driver isn't blocked and at the same time we don't need to page
 :protect during every strategy call.
 :
 :I may have misunderstood your intent, but using page protections
 :on each IO would seem to introduce a lot of performance issues that
 :the rest of these points are all trying to get rid of.
 
 At the low-level device there is no concept of page protections.
 If you pass an array of vm_page_t's then that is where the data
 will be taken from or written to.
 
 A background-write capability is actually much more easily implemented
 at the VM Object level then the buffer cache level.  If you think about
 it, all you need to do is add another VM Object layer *below* the 
 one representing the device.  Whenever a device write is initiated the
 pages are moved to the underlying layer.  If a process (or the kernel)
 needs to modify the pages while the write is in progress, a copy-on-write
 occurs through normal mechanisms.  On completion of the I/O the pages
 are moved back to the main VM Object device layer except for those that
 would conflict with any copy-on-write that occured (the original device
 pages in the conflict case simply get thrown away).  
 
 Problem solved.  Plus this deals with low-memory situations properly...
 we do not introduce any new deadlocks.

That does sound a lot better, using the buffer system for anything more
than describing an IO is a hack and I'd like to see an implementation such
as this be possible.

 
 :   The idea for the buffer cache is to shift its functionality to one that
 :   is solely used to issue device I/O and to keep track of dirty areas for
 :   proper sequencing of I/O (e.g. softupdate's use of the buffer cache 
 :   to placemark I/O will not change).  The core buffer cache code would
 :...
 :
 :Keeping the currect cluster code is a bad idea, if the drivers were
 :taught how to traverse the linked list in the buf struct rather
 :than just notice "a big buffer" we could avoid a lot of page
 :twiddling and also allow for massive IO clustering (  64k ) because
 :we won't be limited by the size of the b_pages[] array for our
 :upper bound on the amount of buffers we can issue effectively a
 :scatter/gather on (since the drivers must VTOPHYS them anyway).
 
 This devolves down into how simple (or complex) an interface we
 are willing to use to talk to the low-level device.
 
 The reason I would hesitate to move to a 'linked list of buffers'
 methodology is that *ALL* of the current VM API's pass a single
 array of vm_page_t's... not just the current struct buf code, but also
 the VOP_PUTPAGES and VOP_GETPAGES API.
 
 I would much prefer to keep this simplicity intact in order to avoid
 introducing even more bugs into the source then we will when we try
 to do this stuff, which means changing the clustering code from:
 
   * copies vm_page_t's into the cluster pbuf's b_pages[] array
   * maps the pages into b_data
 
 to:
 
 
   * copies vm_page_t's into the cluster pbuf's b_pages[] array
 
 In otherwords, keeping the clustering changes as simple as possible.
 I think once the new I/O path is operational we can then start thinking
 about how to optimize it -- for example, by having a default (embedded)
 static array but also allowing the b_pages array to be dynamically
 allocated.

Why?  Why allocate a special buffer pbuf just for all of this, problems
can develop where you may have implemented this, and now clustering can grow
without (nearly) any bounds, however now you 

Re: patches for test / review

2000-03-20 Thread Matthew Dillon


:  lock on the bp.  With a shared lock you are allowed to issue READ
:  I/O but you are not allowed to modify the contents of the buffer.
:  With an exclusive lock you are allowed to issue both READ and WRITE
:  I/O and you can modify the contents of the buffer.
: 
:   bread()  - bread_sh() and bread_ex()
: 
:  Obtain and validate (issue read I/O as appropriate) a bp.  bread_sh()
:  allows a buffer to be accessed but not modified or rewritten.
:  bread_ex() allows a buffer to be modified and written.
:
:This seems to allow for expressing intent to write to buffers,
:which would be an excellent place to cow the pages 'in software'
:rather than obsd's way of using cow'd pages to accomplish the same
:thing.

Yes, absolutely.  DG (if I remember right) is rabid about not taking
VM faults while sitting in the kernel and I tend to agree with him that
it's a cop-out to use VM faults in the kernel to get around those
sorts of problems.

:I'm not sure if you remeber what I brought up at BAFUG, but I'd
:like to see something along the lines of BX_BKGRDWRITE that Kirk
:is using for the bitmaps blocks in softupdates to be enabled on a
:system wide basis.  That way rewriting data that has been sent to
:the driver isn't blocked and at the same time we don't need to page
:protect during every strategy call.
:
:I may have misunderstood your intent, but using page protections
:on each IO would seem to introduce a lot of performance issues that
:the rest of these points are all trying to get rid of.

At the low-level device there is no concept of page protections.
If you pass an array of vm_page_t's then that is where the data
will be taken from or written to.

A background-write capability is actually much more easily implemented
at the VM Object level then the buffer cache level.  If you think about
it, all you need to do is add another VM Object layer *below* the 
one representing the device.  Whenever a device write is initiated the
pages are moved to the underlying layer.  If a process (or the kernel)
needs to modify the pages while the write is in progress, a copy-on-write
occurs through normal mechanisms.  On completion of the I/O the pages
are moved back to the main VM Object device layer except for those that
would conflict with any copy-on-write that occured (the original device
pages in the conflict case simply get thrown away).  

Problem solved.  Plus this deals with low-memory situations properly...
we do not introduce any new deadlocks.

:   The idea for the buffer cache is to shift its functionality to one that
:   is solely used to issue device I/O and to keep track of dirty areas for
:   proper sequencing of I/O (e.g. softupdate's use of the buffer cache 
:   to placemark I/O will not change).  The core buffer cache code would
:...
:
:Keeping the currect cluster code is a bad idea, if the drivers were
:taught how to traverse the linked list in the buf struct rather
:than just notice "a big buffer" we could avoid a lot of page
:twiddling and also allow for massive IO clustering (  64k ) because
:we won't be limited by the size of the b_pages[] array for our
:upper bound on the amount of buffers we can issue effectively a
:scatter/gather on (since the drivers must VTOPHYS them anyway).

This devolves down into how simple (or complex) an interface we
are willing to use to talk to the low-level device.

The reason I would hesitate to move to a 'linked list of buffers'
methodology is that *ALL* of the current VM API's pass a single
array of vm_page_t's... not just the current struct buf code, but also
the VOP_PUTPAGES and VOP_GETPAGES API.

I would much prefer to keep this simplicity intact in order to avoid
introducing even more bugs into the source then we will when we try
to do this stuff, which means changing the clustering code from:

* copies vm_page_t's into the cluster pbuf's b_pages[] array
* maps the pages into b_data

to:


* copies vm_page_t's into the cluster pbuf's b_pages[] array

In otherwords, keeping the clustering changes as simple as possible.
I think once the new I/O path is operational we can then start thinking
about how to optimize it -- for example, by having a default (embedded)
static array but also allowing the b_pages array to be dynamically
allocated.

:To realize my "nfs super commit" stuff all we'd need to do is make
:the max cluster size something like 0-1 and instantly get an almost
:unbounded IO burst.
:
:-- 
:-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]
:

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Paul Richards

Alfred Perlstein wrote:
 
 * Poul-Henning Kamp [EMAIL PROTECTED] [000320 11:45] wrote:
  In message [EMAIL PROTECTED], Alfred Perlstein writes:
 
  Keeping the currect cluster code is a bad idea, if the drivers were
  taught how to traverse the linked list in the buf struct rather
  than just notice "a big buffer" we could avoid a lot of page
  twiddling and also allow for massive IO clustering (  64k )
 
  Before we redesign the clustering, I would like to know if we
  actually have any recent benchmarks which prove that clustering
  is overall beneficial ?
 
 Yes it is really benificial.

Yes, I've seen stats that show the degradation when clustering is
switched off.
Richard Wendlake (who wrote the OS detection code for the Netcraft web
server survey) did a lot of testing in this area because of some
pathological behavior he was seeing using Gnu's dbm package. 

Richard, do you want to post a summary of your tests?

 
 I'm not talking about a redesign of the clustering code as much as
 making the drivers that take a callback from it actually traverse
 the 'union cluster_info' rather than relying on the system to fake
 the pages being contiguous via remapping.
 
 There's nothing wrong with the clustering algorithms, it's just the
 steps it has to take to work with the drivers.

Well, there is something wrong with our clustering algorithm. It always
starts a new cluster when the first block of a file is written to. I
found this when trying to explain some of the pathological behavior that
Richard was seeing.

Imagine an algorithm that will write blocks 0,5,2,7,4,1,6,3,0,...

The clustering algorithm starts a new cluster if the block is at the
beginning
of the file, so writing block 0 will always start a new cluster. When
block 5 is written out, the clustering code will try and add it to the
existing cluster, will fail and so will flush the existing cluster which
only has block 0 in it and then start another cluster, with block 5 in
it. This continues, with the previous cluster being flushed and a new
cluster being created with the current block in it. Eventually, we get
to the point where 7 blocks have been flushed and the current cluster
contains block 3. When it comes to write out the next block 0 the
clustering algorithm doesn't bother trying to add the block to the
existing cluster but immediately starts a new one so the cluster with
block 3 in it *never gets flushed*.


Paul.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: I/O clustering, Re: patches for test / review

2000-03-20 Thread Mike Smith


Just as a perhaps interesting aside on this topic; it'd be quite 
neat for controllers that understand scatter/gather to be able to 
simply suck N regions of buffer cache which were due for committing 
directly into an S/G list...

(wishlist item, I guess 8)

-- 
\\ Give a man a fish, and you feed him for a day. \\  Mike Smith
\\ Tell him he should learn how to fish himself,  \\  [EMAIL PROTECTED]
\\ and he'll hate you for a lifetime. \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: I/O clustering, Re: patches for test / review

2000-03-20 Thread David Greenman

Committing a 64k block would require 8 times the overhead of bundling
up the RPC as well as transmission and reply, it may be possible
to pipeline these commits because you don't really need to wait
for one to complete before issueing another request, but it's still
8x the amount of traffic.

I agree that it is obvious for NFS, but I don't see it as being
obvious at all for (modern) disks, so for that case I would like
to see numbers.

If running without clustering is just as fast for modern disks,
I think the clustering needs rethought.

   Depends on the type of disk drive and how it is configured. Some drives
perform badly (skip a revolution) with back-to-back writes. In all cases,
without aggregation of blocks, you pay the extra cost of additional interrupts
and I/O rundowns, which can be a significant factor. Also, unless the blocks
were originally written by the application in a chunk, they will likely be
mixed with blocks to varying locations, in which case for drives without
write caching enabled, you'll have additional seeks to write the blocks out.
Things like this don't show up when doing simplistic sequential write tests.

-DG

David Greenman
Co-founder/Principal Architect, The FreeBSD Project - http://www.freebsd.org
Creator of high-performance Internet servers - http://www.terasolutions.com
Pave the road of life with opportunities.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Poul-Henning Kamp

In message [EMAIL PROTECTED], Alfred Perlstein writes:
* Poul-Henning Kamp [EMAIL PROTECTED] [000320 11:45] wrote:
 In message [EMAIL PROTECTED], Alfred Perlstein writes:
 
 Keeping the currect cluster code is a bad idea, if the drivers were
 taught how to traverse the linked list in the buf struct rather
 than just notice "a big buffer" we could avoid a lot of page
 twiddling and also allow for massive IO clustering (  64k ) 
 
 Before we redesign the clustering, I would like to know if we
 actually have any recent benchmarks which prove that clustering
 is overall beneficial ?

Yes it is really benificial.

I'm not talking about a redesign of the clustering code as much as
making the drivers that take a callback from it actually traverse
the 'union cluster_info' rather than relying on the system to fake
the pages being contiguous via remapping.

There's nothing wrong with the clustering algorithms, it's just the
steps it has to take to work with the drivers.

Hmm, try to keep vinum/RAID5 in the picture when you look at this code,
it complicated matters a lot.

--
Poul-Henning Kamp FreeBSD coreteam member
[EMAIL PROTECTED]   "Real hackers run -current on their laptop."
FreeBSD -- It will take a long time before progress goes too far!


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: I/O clustering, Re: patches for test / review

2000-03-20 Thread Mike Smith

 I agree that it is obvious for NFS, but I don't see it as being
 obvious at all for (modern) disks, so for that case I would like
 to see numbers.
 
 If running without clustering is just as fast for modern disks,
 I think the clustering needs rethought.

I think it should be pretty obvious, actually.  Command overhead is large 
(and not getting much smaller), and clustering primarily serves to reduce 
the number of commands and thus the ratio of command time vs. data time.

So unless the clustering implementation is extremely poor, it's 
worthwhile.
-- 
\\ Give a man a fish, and you feed him for a day. \\  Mike Smith
\\ Tell him he should learn how to fish himself,  \\  [EMAIL PROTECTED]
\\ and he'll hate you for a lifetime. \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Wilko Bulte

On Mon, Mar 20, 2000 at 08:21:52PM +0100, Poul-Henning Kamp wrote:
 In message [EMAIL PROTECTED], Alfred Perlstein writes:
 
 Keeping the currect cluster code is a bad idea, if the drivers were
 taught how to traverse the linked list in the buf struct rather
 than just notice "a big buffer" we could avoid a lot of page
 twiddling and also allow for massive IO clustering (  64k ) 
 
 Before we redesign the clustering, I would like to know if we
 actually have any recent benchmarks which prove that clustering
 is overall beneficial ?
 
 I would think that track-caches and intelligent drives would gain
 much if not more of what clustering was designed to do gain.

Hm. But I'd think that even with modern drives a smaller number of bigger
I/Os is preferable over lots of very small I/Os. Or have I missed the point?

-- 
Wilko Bulte Arnhem, The Netherlands   
http://www.tcja.nl  The FreeBSD Project: http://www.freebsd.org


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: patches for test / review

2000-03-20 Thread Matthew Jacob

 
 Hm. But I'd think that even with modern drives a smaller number of bigger
 I/Os is preferable over lots of very small I/Os.

Not necessarily. It depends upon overhead costs per-i/o. With larger I/Os, you
do pay in interference costs (you can't transfer data for request N because
the 256Kbytes of request M is still in the pipe).





To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message