Re: [zfs-discuss] Self-tuning recordsize

2006-10-17 Thread Erblichs
Group, et al, 

I don't understand that if the problem is systemic based on
the number of continual dirty pages and stress to clean
those pages, then why .

If the problem is FS independent, because any number of
different installed FSs can equally consume pages.
Thus, to solve the problem on a per FS basis seems to me a
bandaid approach..

Then why doesn't the OS determine that a dangerous level of high 
watermark number of pages are continually being paged out 
(we have swapped and have a large percentage of available pages 
always dirty: based on recent past history) and thus,

 * force the writes to a set of predetermined pages (limit the
   number of pages for I/O),
 * these pages get I/O scheduled immediately, not waiting for
   a need for these pages and finding them dirty, 
   (hopefully a percentage of these pages will be cleaned and
be immediately available if needed in the near future),

 Yes, the OS could redirect the I/O as being direct without
 using the page cache, but the assumption is that these
 procs are behaving as multiple-readers and need the cached
 page data in the near future. Thus, changing the behaviour
 to remove whether the pages are cached bcause they CAN
 totally consume the cache removes the multiple-reader
 reader to cache the data in the first place, thus...


*  guarantee that heartbeats are always regular by preserving
   5 to 20% of pages for exec / text,
*  limit the number of interrupts being generated by network
   so low level SCSI interrupts can page and not be starved,
   (something the white paper did not mention),
   (yes, this will cause the loss of UDP based data but we
need to generate some form of backpressure / explicit
congestion event),
* if the files coming in from network were TCP based, hopefully
  a segment would be dropped and act as a backpressure to
  the originator of the data,
* if the files are being read from the FS, then a max I/O rate 
  should be determined based on the number of pages that are 
  clean and ready to accept FS data,
*  etc

Thus, tuning to determine whether the page cache should be used
for write or read, should allow one set of processes not to
adversely effect the operation of other processes.

And any OS, should only slow down the dirty I/O pages for
those specific processes and other processes work being
unaware of the I/O issues..

Mitchell Erblich
-

Richard Elling - PAE wrote:
 
 Roch wrote:
  Oracle will typically create it's files with 128K writes
  not recordsize ones.
 
 Blast from the past...
 http://www.sun.com/blueprints/0400/ram-vxfs.pdf
 
   -- richard
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Self-tuning recordsize

2006-10-17 Thread Jeremy Teo

Heya Roch,

On 10/17/06, Roch [EMAIL PROTECTED] wrote:
-snip-

Oracle will typically create it's files with 128K writes
not recordsize ones.


Darn, that makes things difficult doesn't it? :(

Come to think of it, maybe we're approaching things from the wrong
perspective. Databases such as Oracle have their own cache *anyway*.
The reason why we set recordsize is to

1) Match the block size to the write/read access size, so we don't
read/buffer too much.

But if we improved our caching to not cache blocks being accessed by a
database, wouldn't that solve the problem also?

(I'm not precisely sure where and how much performance we win from
setting an optimal recordsize)

Thanks for listening folks! :)

--
Regards,
Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Self-tuning recordsize

2006-10-17 Thread Frank Cusack
On October 17, 2006 2:02:19 AM -0700 Erblichs [EMAIL PROTECTED] 
wrote:

Group, et al,

I don't understand that if the problem is systemic based on
the number of continual dirty pages and stress to clean
those pages, then why .

If the problem is FS independent, because any number of
different installed FSs can equally consume pages.
Thus, to solve the problem on a per FS basis seems to me a
bandaid approach..


I'm not very well versed in this stuff, but ISTM you can't guarantee
on-disk consistency unless the problem is dealt with per-FS.

-frank
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Self-tuning recordsize

2006-10-16 Thread Roch

Matthew Ahrens writes:
  Jeremy Teo wrote:
   Would it be worthwhile to implement heuristics to auto-tune
   'recordsize', or would that not be worth the effort?
  
  Here is one relatively straightforward way you could implement this.
  
  You can't (currently) change the recordsize once there are multiple 
  blocks in the file.  This shouldn't be too bad because by the time 
  they've written 128k, you should have enough info to make the choice. 
  In fact, that might make a decent algorithm:
  
  * Record the first write size (in the ZPL's znode)
  * If subsequent writes differ from that size, reset write size to zero
  * When a write comes in past 128k, see if the write size is still 
  nonzero; if so, then read in the 128k, decrease the blocksize to the 
  write size, fill in the 128k again, and finally do the new write.
  
  Obviously you will have to test this algorithm and make sure that it 
  actually detects the recordsize on various databases.  They may like to 
  initialize their files with large writes, which would break this.  If 
  you have to change the recordsize once the file is big, you will have to 
  rewrite everything[*], which would be time consuming.
  
  --matt
  

Oracle will typically create it's files with 128K writes
not recordsize ones.

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Self-tuning recordsize

2006-10-15 Thread Matthew Ahrens

Jeremy Teo wrote:

Would it be worthwhile to implement heuristics to auto-tune
'recordsize', or would that not be worth the effort?


Here is one relatively straightforward way you could implement this.

You can't (currently) change the recordsize once there are multiple 
blocks in the file.  This shouldn't be too bad because by the time 
they've written 128k, you should have enough info to make the choice. 
In fact, that might make a decent algorithm:


* Record the first write size (in the ZPL's znode)
* If subsequent writes differ from that size, reset write size to zero
* When a write comes in past 128k, see if the write size is still 
nonzero; if so, then read in the 128k, decrease the blocksize to the 
write size, fill in the 128k again, and finally do the new write.


Obviously you will have to test this algorithm and make sure that it 
actually detects the recordsize on various databases.  They may like to 
initialize their files with large writes, which would break this.  If 
you have to change the recordsize once the file is big, you will have to 
rewrite everything[*], which would be time consuming.


--matt

[*] Or if you're willing to hack up the DMU and SPA, you'll just have 
to re-read everything to compute the new checksums and re-write all the 
indirect blocks.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Self-tuning recordsize

2006-10-15 Thread Torrey McMahon

Matthew Ahrens wrote:

Jeremy Teo wrote:

Would it be worthwhile to implement heuristics to auto-tune
'recordsize', or would that not be worth the effort?


It would be really great to automatically select the proper recordsize 
for each file!  How do you suggest doing so?



Maybe I've been thinking with my systems hat on to tight but why not 
have a hook into ZFS where an application, if written to the proper 
spec, can tell ZFS what it's desired recordsize is? Then you don't have 
to play any guessing games.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Self-tuning recordsize

2006-10-14 Thread Nicolas Williams
On Fri, Oct 13, 2006 at 09:22:53PM -0700, Erblichs wrote:
   For extremely large files (25 to 100GBs), that are accessed 
   sequentially for both read  write, I would expect 64k or 128k. 

Lager files accessed sequentially don't need any special heuristic for
record size determination: just use the filesystem's record size and be
done.  The bigger the record size, the better -- a form of read ahead.

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Self-tuning recordsize

2006-10-14 Thread Erblichs
Nico,

Yes, I agree.

But also single random large single read and writes would also
benefit from a large record size. So, I didn't try make that
distinction. However, I guess  that the best random large
reads  writes would fall within single filesystem record
sizes.

No, I haven't reviewed whether the holes (disk block space)
tend to be multiples of record size, page size, or ..
Would a write of recordsize that didn't fall on a record size 
boundry write into 2 filesystem blocks / records?

However, would extremely large record sizes, say 1MB (or more)
 (what is the limit?), open up write atomicity issues
or file corruption issues? Would record sizes like these
be equal to mulitple track writes?

Also, because of the disk block allocation stategy, I
wasn't too sure that any form of multiple disk block 
contigousness still applied with ZFS with smaller record 
sizes.. Yes, to minimize seek and rotational latencies
and help with read ahead and write behind...

Oh, but once writes have begun to the file, in the past,
this has frozen the recordsize. So self-tuning or
adjustments NEED to be decided probably at the create
of the FS object. OR some type of copy mechanism needs to
be done to a new file with a different record size at a
later time when the default or past record size was
determined to be significantly incorrect. Yes, I assume
that many  reads /writes will occur in the future that 
will amortize the copy cost.

So, yes group... I am still formulating the best
algorithm for this. ZFS uses alot of past gained
knowledege applied to UFS (page lists stuff, chksum stuff,
large file awwareness/support), but adds a new twist to 
things..

Mitchell Erblich
--




Nicolas Williams wrote:
 
 On Fri, Oct 13, 2006 at 09:22:53PM -0700, Erblichs wrote:
For extremely large files (25 to 100GBs), that are accessed
sequentially for both read  write, I would expect 64k or 128k.
 
 Lager files accessed sequentially don't need any special heuristic for
 record size determination: just use the filesystem's record size and be
 done.  The bigger the record size, the better -- a form of read ahead.
 
 Nico
 --
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Self-tuning recordsize

2006-10-13 Thread Jeremy Teo

Would it be worthwhile to implement heuristics to auto-tune
'recordsize', or would that not be worth the effort?

--
Regards,
Jeremy
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Self-tuning recordsize

2006-10-13 Thread Matthew Ahrens

Jeremy Teo wrote:

Would it be worthwhile to implement heuristics to auto-tune
'recordsize', or would that not be worth the effort?


It would be really great to automatically select the proper recordsize 
for each file!  How do you suggest doing so?


--matt

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Self-tuning recordsize

2006-10-13 Thread Nicolas Williams
On Fri, Oct 13, 2006 at 08:30:27AM -0700, Matthew Ahrens wrote:
 Jeremy Teo wrote:
 Would it be worthwhile to implement heuristics to auto-tune
 'recordsize', or would that not be worth the effort?
 
 It would be really great to automatically select the proper recordsize 
 for each file!  How do you suggest doing so?

I would suggest the following:

 - on file creation start with record size = 8KB (or some such smallish
   size), but don't record this on-disk yet

 - keep the record size at 8KB until the file exceeds some size, say,
   .5MB, at which point the most common read size, if there were enough
   reads, or the most common write size otherwise, should be used to
   derive the actual file record size (rounding up if need be)

- if the selected record size != 8KB then re-write the file with the
  new record size

- record the file's selected record size in an extended attribute

 - on truncation keep the existing file record size

 - on open of non-empty files without associated file record size stick
   to the original approach (growing the file block size up to the FS
   record size, defaulting to 128KB)

I think we should create a namespace for Solaris-specific extended
attributes.

The file record size attribute should be writable, but changes in record
size should only be allowed when the file is empty or when the file data
is in one block.  E.g., writing 8KB to a file's RS EA when the file's
larger than 8KB or consists of more than one block should appear to
succeed, but a subsequent read of the RS EA should show the previous
record size.

This approach might lead to the creation of new tunables for controlling
the heuristic (e.g., which heuristic, initial RS, file size at which RS
will be determined, default RS when none can be determined).

Nico
-- 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Self-tuning recordsize

2006-10-13 Thread Erblichs
Group,

I am not sure I agree with the 8k size.

Since recordsize is based on the size of filesystem blocks
for large files, my first consideration is what will be
the max size of the file object.

For extremely large files (25 to 100GBs), that are accessed 
sequentially for both read  write, I would expect 64k or 128k. 

Putpage functions attempt to grab a number of pages off the
vnode and place their modified contents within disk blocks.
Thus if disk blocks are larger, then a fewer of them are needed,
and can result in a more efficient operations.

However, any small change to the filesystem block would result
in the entire filesystem block being accessed, so small accesses
to the block are very inefficent.

Lastly, the access to a larger block will occupy the media
for longer periods of continuous time, possibly creating a
larger latency than necessary for another non-related op.

Hope this helps...

Mitchell Erblich
---


Nicolas Williams wrote:
 
 On Fri, Oct 13, 2006 at 08:30:27AM -0700, Matthew Ahrens wrote:
  Jeremy Teo wrote:
  Would it be worthwhile to implement heuristics to auto-tune
  'recordsize', or would that not be worth the effort?
 
  It would be really great to automatically select the proper recordsize
  for each file!  How do you suggest doing so?
 
 I would suggest the following:
 
  - on file creation start with record size = 8KB (or some such smallish
size), but don't record this on-disk yet
 
  - keep the record size at 8KB until the file exceeds some size, say,
.5MB, at which point the most common read size, if there were enough
reads, or the most common write size otherwise, should be used to
derive the actual file record size (rounding up if need be)
 
 - if the selected record size != 8KB then re-write the file with the
   new record size
 
 - record the file's selected record size in an extended attribute
 
  - on truncation keep the existing file record size
 
  - on open of non-empty files without associated file record size stick
to the original approach (growing the file block size up to the FS
record size, defaulting to 128KB)
 
 I think we should create a namespace for Solaris-specific extended
 attributes.
 
 The file record size attribute should be writable, but changes in record
 size should only be allowed when the file is empty or when the file data
 is in one block.  E.g., writing 8KB to a file's RS EA when the file's
 larger than 8KB or consists of more than one block should appear to
 succeed, but a subsequent read of the RS EA should show the previous
 record size.
 
 This approach might lead to the creation of new tunables for controlling
 the heuristic (e.g., which heuristic, initial RS, file size at which RS
 will be determined, default RS when none can be determined).
 
 Nico
 --
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss