Re: [zfs-discuss] Self-tuning recordsize
Group, et al, I don't understand that if the problem is systemic based on the number of continual dirty pages and stress to clean those pages, then why . If the problem is FS independent, because any number of different installed FSs can equally consume pages. Thus, to solve the problem on a per FS basis seems to me a bandaid approach.. Then why doesn't the OS determine that a dangerous level of high watermark number of pages are continually being paged out (we have swapped and have a large percentage of available pages always dirty: based on recent past history) and thus, * force the writes to a set of predetermined pages (limit the number of pages for I/O), * these pages get I/O scheduled immediately, not waiting for a need for these pages and finding them dirty, (hopefully a percentage of these pages will be cleaned and be immediately available if needed in the near future), Yes, the OS could redirect the I/O as being direct without using the page cache, but the assumption is that these procs are behaving as multiple-readers and need the cached page data in the near future. Thus, changing the behaviour to remove whether the pages are cached bcause they CAN totally consume the cache removes the multiple-reader reader to cache the data in the first place, thus... * guarantee that heartbeats are always regular by preserving 5 to 20% of pages for exec / text, * limit the number of interrupts being generated by network so low level SCSI interrupts can page and not be starved, (something the white paper did not mention), (yes, this will cause the loss of UDP based data but we need to generate some form of backpressure / explicit congestion event), * if the files coming in from network were TCP based, hopefully a segment would be dropped and act as a backpressure to the originator of the data, * if the files are being read from the FS, then a max I/O rate should be determined based on the number of pages that are clean and ready to accept FS data, * etc Thus, tuning to determine whether the page cache should be used for write or read, should allow one set of processes not to adversely effect the operation of other processes. And any OS, should only slow down the dirty I/O pages for those specific processes and other processes work being unaware of the I/O issues.. Mitchell Erblich - Richard Elling - PAE wrote: Roch wrote: Oracle will typically create it's files with 128K writes not recordsize ones. Blast from the past... http://www.sun.com/blueprints/0400/ram-vxfs.pdf -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
Heya Roch, On 10/17/06, Roch [EMAIL PROTECTED] wrote: -snip- Oracle will typically create it's files with 128K writes not recordsize ones. Darn, that makes things difficult doesn't it? :( Come to think of it, maybe we're approaching things from the wrong perspective. Databases such as Oracle have their own cache *anyway*. The reason why we set recordsize is to 1) Match the block size to the write/read access size, so we don't read/buffer too much. But if we improved our caching to not cache blocks being accessed by a database, wouldn't that solve the problem also? (I'm not precisely sure where and how much performance we win from setting an optimal recordsize) Thanks for listening folks! :) -- Regards, Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
On October 17, 2006 2:02:19 AM -0700 Erblichs [EMAIL PROTECTED] wrote: Group, et al, I don't understand that if the problem is systemic based on the number of continual dirty pages and stress to clean those pages, then why . If the problem is FS independent, because any number of different installed FSs can equally consume pages. Thus, to solve the problem on a per FS basis seems to me a bandaid approach.. I'm not very well versed in this stuff, but ISTM you can't guarantee on-disk consistency unless the problem is dealt with per-FS. -frank ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
Matthew Ahrens writes: Jeremy Teo wrote: Would it be worthwhile to implement heuristics to auto-tune 'recordsize', or would that not be worth the effort? Here is one relatively straightforward way you could implement this. You can't (currently) change the recordsize once there are multiple blocks in the file. This shouldn't be too bad because by the time they've written 128k, you should have enough info to make the choice. In fact, that might make a decent algorithm: * Record the first write size (in the ZPL's znode) * If subsequent writes differ from that size, reset write size to zero * When a write comes in past 128k, see if the write size is still nonzero; if so, then read in the 128k, decrease the blocksize to the write size, fill in the 128k again, and finally do the new write. Obviously you will have to test this algorithm and make sure that it actually detects the recordsize on various databases. They may like to initialize their files with large writes, which would break this. If you have to change the recordsize once the file is big, you will have to rewrite everything[*], which would be time consuming. --matt Oracle will typically create it's files with 128K writes not recordsize ones. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
Jeremy Teo wrote: Would it be worthwhile to implement heuristics to auto-tune 'recordsize', or would that not be worth the effort? Here is one relatively straightforward way you could implement this. You can't (currently) change the recordsize once there are multiple blocks in the file. This shouldn't be too bad because by the time they've written 128k, you should have enough info to make the choice. In fact, that might make a decent algorithm: * Record the first write size (in the ZPL's znode) * If subsequent writes differ from that size, reset write size to zero * When a write comes in past 128k, see if the write size is still nonzero; if so, then read in the 128k, decrease the blocksize to the write size, fill in the 128k again, and finally do the new write. Obviously you will have to test this algorithm and make sure that it actually detects the recordsize on various databases. They may like to initialize their files with large writes, which would break this. If you have to change the recordsize once the file is big, you will have to rewrite everything[*], which would be time consuming. --matt [*] Or if you're willing to hack up the DMU and SPA, you'll just have to re-read everything to compute the new checksums and re-write all the indirect blocks. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
Matthew Ahrens wrote: Jeremy Teo wrote: Would it be worthwhile to implement heuristics to auto-tune 'recordsize', or would that not be worth the effort? It would be really great to automatically select the proper recordsize for each file! How do you suggest doing so? Maybe I've been thinking with my systems hat on to tight but why not have a hook into ZFS where an application, if written to the proper spec, can tell ZFS what it's desired recordsize is? Then you don't have to play any guessing games. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
On Fri, Oct 13, 2006 at 09:22:53PM -0700, Erblichs wrote: For extremely large files (25 to 100GBs), that are accessed sequentially for both read write, I would expect 64k or 128k. Lager files accessed sequentially don't need any special heuristic for record size determination: just use the filesystem's record size and be done. The bigger the record size, the better -- a form of read ahead. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
Nico, Yes, I agree. But also single random large single read and writes would also benefit from a large record size. So, I didn't try make that distinction. However, I guess that the best random large reads writes would fall within single filesystem record sizes. No, I haven't reviewed whether the holes (disk block space) tend to be multiples of record size, page size, or .. Would a write of recordsize that didn't fall on a record size boundry write into 2 filesystem blocks / records? However, would extremely large record sizes, say 1MB (or more) (what is the limit?), open up write atomicity issues or file corruption issues? Would record sizes like these be equal to mulitple track writes? Also, because of the disk block allocation stategy, I wasn't too sure that any form of multiple disk block contigousness still applied with ZFS with smaller record sizes.. Yes, to minimize seek and rotational latencies and help with read ahead and write behind... Oh, but once writes have begun to the file, in the past, this has frozen the recordsize. So self-tuning or adjustments NEED to be decided probably at the create of the FS object. OR some type of copy mechanism needs to be done to a new file with a different record size at a later time when the default or past record size was determined to be significantly incorrect. Yes, I assume that many reads /writes will occur in the future that will amortize the copy cost. So, yes group... I am still formulating the best algorithm for this. ZFS uses alot of past gained knowledege applied to UFS (page lists stuff, chksum stuff, large file awwareness/support), but adds a new twist to things.. Mitchell Erblich -- Nicolas Williams wrote: On Fri, Oct 13, 2006 at 09:22:53PM -0700, Erblichs wrote: For extremely large files (25 to 100GBs), that are accessed sequentially for both read write, I would expect 64k or 128k. Lager files accessed sequentially don't need any special heuristic for record size determination: just use the filesystem's record size and be done. The bigger the record size, the better -- a form of read ahead. Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Self-tuning recordsize
Would it be worthwhile to implement heuristics to auto-tune 'recordsize', or would that not be worth the effort? -- Regards, Jeremy ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
Jeremy Teo wrote: Would it be worthwhile to implement heuristics to auto-tune 'recordsize', or would that not be worth the effort? It would be really great to automatically select the proper recordsize for each file! How do you suggest doing so? --matt ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
On Fri, Oct 13, 2006 at 08:30:27AM -0700, Matthew Ahrens wrote: Jeremy Teo wrote: Would it be worthwhile to implement heuristics to auto-tune 'recordsize', or would that not be worth the effort? It would be really great to automatically select the proper recordsize for each file! How do you suggest doing so? I would suggest the following: - on file creation start with record size = 8KB (or some such smallish size), but don't record this on-disk yet - keep the record size at 8KB until the file exceeds some size, say, .5MB, at which point the most common read size, if there were enough reads, or the most common write size otherwise, should be used to derive the actual file record size (rounding up if need be) - if the selected record size != 8KB then re-write the file with the new record size - record the file's selected record size in an extended attribute - on truncation keep the existing file record size - on open of non-empty files without associated file record size stick to the original approach (growing the file block size up to the FS record size, defaulting to 128KB) I think we should create a namespace for Solaris-specific extended attributes. The file record size attribute should be writable, but changes in record size should only be allowed when the file is empty or when the file data is in one block. E.g., writing 8KB to a file's RS EA when the file's larger than 8KB or consists of more than one block should appear to succeed, but a subsequent read of the RS EA should show the previous record size. This approach might lead to the creation of new tunables for controlling the heuristic (e.g., which heuristic, initial RS, file size at which RS will be determined, default RS when none can be determined). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Self-tuning recordsize
Group, I am not sure I agree with the 8k size. Since recordsize is based on the size of filesystem blocks for large files, my first consideration is what will be the max size of the file object. For extremely large files (25 to 100GBs), that are accessed sequentially for both read write, I would expect 64k or 128k. Putpage functions attempt to grab a number of pages off the vnode and place their modified contents within disk blocks. Thus if disk blocks are larger, then a fewer of them are needed, and can result in a more efficient operations. However, any small change to the filesystem block would result in the entire filesystem block being accessed, so small accesses to the block are very inefficent. Lastly, the access to a larger block will occupy the media for longer periods of continuous time, possibly creating a larger latency than necessary for another non-related op. Hope this helps... Mitchell Erblich --- Nicolas Williams wrote: On Fri, Oct 13, 2006 at 08:30:27AM -0700, Matthew Ahrens wrote: Jeremy Teo wrote: Would it be worthwhile to implement heuristics to auto-tune 'recordsize', or would that not be worth the effort? It would be really great to automatically select the proper recordsize for each file! How do you suggest doing so? I would suggest the following: - on file creation start with record size = 8KB (or some such smallish size), but don't record this on-disk yet - keep the record size at 8KB until the file exceeds some size, say, .5MB, at which point the most common read size, if there were enough reads, or the most common write size otherwise, should be used to derive the actual file record size (rounding up if need be) - if the selected record size != 8KB then re-write the file with the new record size - record the file's selected record size in an extended attribute - on truncation keep the existing file record size - on open of non-empty files without associated file record size stick to the original approach (growing the file block size up to the FS record size, defaulting to 128KB) I think we should create a namespace for Solaris-specific extended attributes. The file record size attribute should be writable, but changes in record size should only be allowed when the file is empty or when the file data is in one block. E.g., writing 8KB to a file's RS EA when the file's larger than 8KB or consists of more than one block should appear to succeed, but a subsequent read of the RS EA should show the previous record size. This approach might lead to the creation of new tunables for controlling the heuristic (e.g., which heuristic, initial RS, file size at which RS will be determined, default RS when none can be determined). Nico -- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss