Re: [zfs-discuss] overhead of snapshot operations

2008-03-21 Thread Bill Moloney
you can find the ZFS on-disk spec at:

http://opensolaris.org/os/community/zfs/docs/ondiskformat0822.pdf

I don't know of any way to produce snapshots at  periodic intervals
other than shell scripts (or a cron job), but the creation and deletion
of snapshots at command level is fairly instantaneous.  

If you have a need for continous snapshots (check points) you may
want to check out the NILFS system (linux open source) available from
NTT japan at:

http://www.nilfs.org/en/

NILFS does continous check points (on all write events), is log based
and allows configuration of the window size (time based)  within which to keep
active checkpoints ... after this amount of time, old checkpoints are discarded
and their space is reclaimed

regards, Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-19 Thread Bill Moloney
Hi Bob ... as richard has mentioned, allocation to vdevs
is done in a fixed sized chunk (richard specs 1MB, but I
remember a 512KB number from the original spec, but this
is not very important), and the allocation algorithm is
basically doing load balancing.

for your non-raid pool, this chunk size will stay fixed regardless
of the block size you choose when creating the file system or the
IO unit size your applications(s) use. (The stripe size can 
dynamically change in a raidz pool, but not in your non-raid pool.)

Measuring bandwidth for you application load is tricky with ZFS,
since there are many hidden IO operations (besides the ones that
your application is requesting) that ZFS must perform.  If you collect
iostats on bytes transferred to hard drives and compare those numbers 
to the amount of data your application(s) transferred you can find
potentially large differences.  The differences in these scenarios are 
largely driven by the IO size your application(s) use. For example, when
I run the following tests here are my observations:
-using dual xeon server with qlogic FC 2G interface
-using a pool with 5 10Krpm FC 146 GB drives
-sequentially writing 4 15GB previously wriiten files in one
 file system in the pool (this file system is using 128KB 
 block size), and a separate thread writing
 each file concurrently for a total of 60GB written
block size writtenactual writtendisk IO observed  BW MB/S%CPU
  4KB  60GB227.3GB   34.2   
 20.4
 32KB 60GB216.5GB   36.1
13.9
128KB60Gb  63.6GB   69.6
31.0

You can see that a small application IO size causes much 
meta-data based IO (more than 3 times the actual application
IO requirements), while the 128KB application writes induce 
only marginally more disk IO than the application actually uses.

the BW numbers here are for just the application data, but when
you consider all the IO from the disks over the test times, the 
physical BW is obviously greater in all cases.

All my drives were uniformly busy in these tests, but the 
small application IO sizes forced much more total IO against
the drives.  In your case the application IO rate would be even
further degraded due to the mirror configuration.  The extra
load of reading and writing meta-data (including ditto-blocks) 
and mirror devices conspire to reduce the application IO rate, 
even though the disk device IO rates may be quite good.  

File system block size reduction only exacerbates the problem by
requiring more meta-data to support the same quantity of
application data, and for sequential IO this is a loser.  In any
case, for a non-raid pool, the allocation chunk size per drive 
(the stripe size) is not influenced by file system block size.

When application IO sizes get small, the overhead in ZFS goes
up dramatically.

regards, Bill

 The application is spending almost all the time
 blocked on I/O.  I see 
 that the number of device writes per second seems
 pretty high.  The 
 application is doing I/O in 128K blocks.  How many
 IOPS does a modern 
 300GB 15K RPM SAS drive typically deliver?  Of course
 the IOPS 
 capacity depends on if the access is random or
 sequential.  At the 
 application level, the access is completely
 sequential but ZFS is 
 likely doing some extra seeks.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS I/O algorithms

2008-03-19 Thread Bill Moloney
 On my own system, when a new file is 
 written, the write block size does not make 
 a significant difference to the write speed

Yes, I've observed the same result ... when a new file is being written 
sequentially, the file data and newly constructed meta-data can be 
built in cache and written in large sequential chunks periodically,
without the need to read in existing meta-data and/or data.  It
seems that data and meta-data that are newly constucted in cache for 
sequential operations will persist in cache effectively, 
and the application IO size is a much less sensitive parameter.
Monitoring disks with iostat in these cases shows the disk IO to 
be only marginally greater than the application IO.
This is why I specified that the write tests
described in my previous post were to existing files.

The overhead of doing small sequential writes to an 
existing object is so much greater than writing to a new 
object, that it begs for some reasonable explanation.
The only one that i've been able assemble in various
experimentation, is that data/meta-data for existing objects 
is not retained effetively in cache if ZFS detects that such an
object is being sequntially written.  This forces the
constant re-reading of the data/meta-data associated with
such an object, causing a huge increase in device IO
traffic that does not seem to accompany the writing of a
brand new object.  The size of RAM seems to make little
difference in this case.

As small sequential writes accumulate in the 5 second cache, the 
chain of meta-data leading to the newly constructed data block may 
see only one pointer (of the 128 in the final set) changing to point to
this newly constructed data block, but all the meta-data from the
uber block to the target must be rewritten on the 5 second flush.
Of course this is not much diffrent from what's happening in the
newly created object scenario, so it must be the behavior that follows 
this flush that's different.  It seems to me that after this flush, some,
or all of the data/meta-data that will be affected next is re-read even
though much of what's needed for subsequent operations should already
be in cache.

My experience with large RAM systems and with the use of SSDs 
as ZFS cache devices has convinced me that data/meta-data associated
with sequential write operations to existing objects (and ZFS seems 
very good at detecting this association) does not get retained 
in cache very effectively.

You can see this very clearly if you look at the IO to a cache
device (ZFS allows you to easily attach a device to a pool as a 
cache device which acts as a sort of L2 type cache for RAM).  
When I do random IO operations to existing objects I
see a large amount of IO to my cache device as RAM fills and ZFS 
pushes cached information (that would otherwise be evicted)
to the SSD cache device.  If I repeat 
the random IO test over the same total file space I see improved
performance as I get occassional hits from the RAM cache and the
SSD cache.  As this extended cache heirarchy warms up with each
test run, my results continue to improve.  If I run sequential write 
operations to exiting objects however,  I see very little activity to 
my SSD cache, and virtually no change in performance  when I 
immediately run the same test again.  

It seems that ZFS is still in need of some fine-tuning for small
sequential write operations to exiting objects.

regards, Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] SSD cache device hangs ZFS

2008-01-17 Thread Bill Moloney
I'm using a FC flash drive as a cache device to one of my pools:
  zpool  add  pool-name  cache  device-name
and I'm running random IO tests to assess performance on a 
snv-78 x86 system

I have a set of threads each doing random reads to about 25% of
its own, previously written, large file ... a test run will read in 
about 20GB on a server with 2GB of RAM

using   zpool iostat,I can see that the SSD device is being used
aggressively, and each time I run my random read test I find
better performance than the previous execution ... I also see my
SSD drive filling up more and more between runs

this behavior is what I expect, and the performance improvements
I see are quite good (4X improvement over 5 runs), but I'm getting
hung from time to time

after several successful runs of my test application, some run of
my test will be running fine, but at some point before it finishes,
I see that all IO to the pool has stopped, and, while I still can use
the system for other things, most operations that involve the pool
will also hang (e.g.   a  wcon a pool based file will hang)

any of these hung processes seem to sleep in the kernel 
at an uninterruptible level, and will not die on a  kill -9  attempt

any attempt to shutdown will hang, and the only way I can recover
is to use the   reboot   -qnd   command (I think that the -d option
in the key since it keeps the system from trying to sync before
reboot)

when I reboot, everything is fine again and I can continue testing
until I run into this problem again ... does anyone have any thoughts
on this issue ? ... thanks, Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] SSD cache device hangs ZFS

2008-01-17 Thread Bill Moloney
Thanks Marion and richard,
but I've run these tests with much larger data sets
and have never had this kind of problem when no
cache device was involved

In fact, if I remove the SSD cache device from my
pool and run the tests, they seem to run with no issues
(except for some reduced performance as I would expect)

the same SSD disk works perfectly as a separate ZIL device,
providing improved IO with synchronous writes on large test
runs of  100GBs

... Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] removing a separate zil device

2008-01-09 Thread Bill Moloney
Thanks to Kyle, richard and Eric

In dealing with this problem, I realize now that I could have
saved myself a lot of grief if I had simply used the  replace  
command and substituted some other drive for my flash
drive before I removed it

I think that this point is critical for anyone who finds themselves
experimenting with separate ZILs ... since a pool will continue to
function with no obvious problems after a separate ZIL is removed,
it's easy to think that, while the benefit of a separate ZIL is gone, the
pool drives have picked up the ZIL function and all is well with the world

the sad reality comes when a reboot or export of the pool occurs
and there is then no way to re-import the pool without re-inserting
the missing ZIL device, and if the missing ZIL device is no longer 
available, the pool is inaccessible ... it's too late now to  do a replace,
because the pool must be imported to do anything ... all the data in
your pool is perfect, but it's perfectly out of reach

... Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Bill Moloney
 I have a question that is related to this topic: Why
 is there only a (tunable) 5 second threshold and not
 also an additional threshold for the buffer size
 (e.g. 50MB)?
 
 Sometimes I see my system writing huge amounts of
 data to a zfs, but the disks staying idle for 5
 seconds, although the memory consumption is already
 quite big and it really would make sense (from my
 uneducated point of view as an observer) to start
 writing all the data to disks. I think this leads to
 the pumping effect that has been previously mentioned
 in one of the forums here.
 
 Can anybody comment on this?
 
 TIA,
 Thomas

because ZFS always writes to a new location on the disk, premature writing
can often result in redundant work ... a single host write to a ZFS object
results in the need to rewrite all of the changed data and meta-data leading
to that object

if a subsequent follow-up write to the same object occurs quickly,
this entire path, once again, has to be recreated, even though only a small 
portion of it is actually different from the previous version

if both versions were written to disk, the result would be to physically write 
potentially large amounts of nearly duplicate information over and over
again, resulting in logically vacant bandwidth

consolidating these writes in host cache eliminates some redundant disk
writing, resulting in more productive bandwidth ... providing some ability to
tune the consolidation time window and/or the accumulated cache size may
seem like a reasonable thing to do, but I think that it's typically a moving
target, and depending on an adaptive, built-in algorithm to dynamically set
these marks (as ZFS claims it does) seems like a better choice

...Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-08 Thread Bill Moloney
 But is seems that when we're talking about full block
 writes (such as 
 sequential file writes) ZFS could do a bit better.
 
 And as long as there is bandwidth left to the disk
 and the controllers, it 
 is difficult to argue that the work is redundant.  If
 it's free in that
 sense, it doesn't matter whether it is redundant.
  But if it turns out NOT
 o have been redundant you save a lot.
 

I think this is why an adaptive algorithm makes sense ... in situations where
frequent, progressive small writes are engaged by an application, the amount
of redundant disk access can be significant, and longer consolidation times
may make sense ... larger writes (= the FS block size) would benefit less 
from longer consolidation times, and shorter thresholds could provide more
usable bandwidth

to get a sense of the issue here, I've done some write testing to previously
written files in a ZFS file system, and the choice of write element size
shows some big swings in actual vs data-driven bandwidth

when I launch a set of threads each of which writes 4KB buffers 
sequentially to its own file, I observe that for 60GB of application 
writes, the disks see 230+GB of IO (reads and writes): 
data-driven BW =~41MB/Sec (my 60GB in ~1500 Sec)
actual BW =~157 MB/Sec (the 230+GB in ~1500 Sec)

if I do the same writes with 128KB buffers (block size of my pool),
the same 60GBs of writes only generate 95GB of disk IO (reads and writes)
data-driven BW =~85MB/Sec (my 60GB in ~700 Sec)
actual BW =~134.6MB/Sec (the 95+GB in ~700 Sec)

in the first case, longer consolidation times would have lead to less total IO
and better data-driven BW, while in the second case shorter consolidation
times would have worked better

as far as redundant writes possibly occupying free bandwidth (and thus
costing nothing), I think you also have to consider the related costs of
additional block scavenging, and less available free space at any specific 
instant, possibly limiting the sequentiality of the next write ... of
course there's also the additional device stress

in any case, I agree with you that ZFS could do a better job in this area,
but it's not as simple as just looking for large or small IOs ...
sequential vs random access patterns also play a big role (as you point out)

I expect  (hope) the adaptive algorithms will mature over time, eventually
providing better behavior over a broader set of operating conditions
... Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] removing a separate zil device

2008-01-07 Thread Bill Moloney
This is a re-post of this issue ... I didn't get any replies to the previous
post of 12/27 ... I'm hoping someone is back from holiday
who may have some insight into this problem ... Bill

when I remove a separate zil disk from a pool, the pool continues to function,
logging synchronous writes to the disks in the pool. Status shows that the log
disk has been removed, and everything seems to work fine until I export the
pool. 

After the pool has been exported (long after the log disk was removed
and gigabytes of synchronous writes were performed successfully), 
I am no longer able to
import the pool. I get an error stating that a pool device cannot be found, 
and importing the pool cannot succeed until the missing device (the separate
zil log disk) is replaced in the system. 

There is a bug filed by Neil Perrin:
6574286 removing a slog doesn't work
regading the problem of not being able to remove a separate zil device from
a pool, but no detail on the ramifications of just taking the device out of
the JBOD. 

Taking it out does not impact the immediate function of the pool,
but the inability to re-import it after this event is a significant issue. Has 
anyone found a workaround for this problem ? I have data in a pool that
I cannot import because the separate zil is no longer available to me.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Intent logs vs Journaling

2008-01-07 Thread Bill Moloney
file system journals may support a variety of availability models, ranging from
simple support for fast recovery (return to consistency) with possible data 
loss, to those that attempt to support synchronous write semantics with no data 
loss on failure, along with fast recovery

the simpler models use a persistent caching scheme for file system meta-data
that can be used to limit the possible sources of file system corruption,
avoiding a complete fsck run after a failure ... the journal specifies the only
possible sources of corruption, allowing a quick check-and-recover mechanism
... here the journal is always written with meta-data changes (at least), 
before the actual updated meta-data in question is over-written to its old
location on disk ... after a failure, the journal indicates what meta-data 
must be checked for consistency

more elaborate models may cache both data and meta-data, to support 
limited data loss, synchronous writes and fast recovery ... newer file systems
often let you choose among these features

since ZFS never updates any data or meta-data in place (anything written into a 
pool is always written to a new (unused) location, it does not have the same
consistency issues that traditional file systems have to deal with ... a ZFS
pool is always in a consistent state, moving an old state to a new state only
after the new state has been completely committed to persistent store ...
the final update to a new state depends on a single atomic write that either
succeeds (moving the system to a consistent new state) or fails, leaving the
system in its current consistent state ... there can be no interim inconsistent
state

a ZFS pool builds its new state information in host memory for some period of
time (about 5 seconds), as host IOs are generated by various applications ...
at the end of this period these buffers are written to fresh locations on 
persistent store as described above, meaning that application writes are
treated asynchronously by default, and in the face a failure, some amount of
information that has been accumulating in host memory can be lost

if an application requires synchronous writes and a guarantee of no data loss,
then ZFS must somehow get the written information to persistent store
before it returns the application write call ... this is where the intent log 
comes
in ... the system call information (including the data) involved in a 
synchronous write operation are written to the intent log on persistent store
before the application write call returns ... but the information is also
written into the host memory buffer scheduled for its 5 sec updates (just
as if it was an asynchronous write) ... at then end of the 5 sec update time 
the new host buffers are written to disk, and, once committed, the intent
log information written to the ZIL is not longer needed and can be jettisoned
(so the ZIL never needs to be very large)

if the system fails, the accumulated but not flushed host buffer information
will be lost, but the ZIL records will already be on disk for any synchronous
writes and can be replayed when the host comes back up, or the pool is
imported by some other living host ... the pool, of course, always comes up
in a consistent state, but any ZIL records can be incorporated into a new 
consistent state before the pool is fully imported for use

the ZIL is always there in host memory, even when no synchronous writes
are being done, since the POSIX fsync() call could be made on an open 
write channel at any time, requiring all to-date writes on that channel
to be committed to persistent store before it returns to the application
... it's cheaper to write the ZIL at this point than to force the entire 5 sec
buffer out prematurely

synchronous writes can clearly have a significant negative performance 
impact in ZFS (or any other system) by forcing writes to disk before having a
chance to do more efficient, aggregated writes (the 5 second type), but
the ZIL solution in ZFS provides a good trade-off with a lot of room to
choose among various levels of performance and potential data loss ...
this is especially true with the recent addition of separate ZIL device
specification ... a small, fast (nvram type) device can be designated for
ZIL use, leaving slower spindle disks for the rest of the pool 

hope this helps ... Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] separate zil removal

2007-12-27 Thread Bill Moloney
when I remove a separate zil disk from a pool, the pool continues to function,
logging synchronous writes to the disks in the pool.  Status shows that the log
disk has been removed, and everything seems to work fine until I export the
pool.  

After the pool has been exported (long after the log disk was removed
and gigabytes of synchronous writes were performed successfully), 
I am no longer able to
import the pool.  I get an error stating that a pool device cannot be found, 
and importing the pool cannot succeed until the missing device (the separate
zil log disk) is replaced in the system.  

There is a bug filed by Neil Perrin:
6574286 removing a slog doesn't work
regading the problem of not being able to remove a separate zil device from
a pool, but no detail on the ramifications of just taking the device out of
the JBOD.  

Taking it out does not impact the immediate function of the pool,
but the inability to re-import it after this event is a significant issue.  Has 
anyone found a workaround for this problem ?  I have data in a pool that
I cannot import because the separate zil is no longer available to me.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] snv-76 panics on installation

2007-11-20 Thread Bill Moloney
I have an Intel based server running dual P3 Xeons (Intel A46044-609, 1.26GHz) 
with a BIOS from American Megatrends Inc (AMIBIOS, SCB2 production BIOS rev 
2.0, BIOS build 0039) with 2GB of RAM

when I attempt to install snv-76 the system panics during the initial boot from 
CD

I've been using this system for extensive testing with ZFS and have had no 
problems installing snv-68, 69 or 70, but I'm having this problem with snv-76

any information regarding this problem or a potential workaround would be 
appreciated

Thx ... bill moloney
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] nv-69 install panics dell precision 670

2007-08-14 Thread Bill Moloney
I have nv-63 installed on a Dell Precision 670 (dual Intel p4s) using zfs with 
no problems

when I attempt to start to install nv-69 from CD #1, just after the Copyright 
notice and Use is subject to license terms prints to the screen (when device 
discovery usually begins), my system panics and begins to reboot 

the solaris panic message splashes across the screen too fast to read before 
immediately resetting and rebooting the machine

I've been able to install nv-69 successfully on other Intel (server) platforms

any ideas or suggestions or help would be appreciated
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] nv-69 install panics dell precision 670

2007-08-14 Thread Bill Moloney
using hyperterm, I captured the panic message as:

SunOS Release 5.11 Version snv_69 32-bit
Copyright 1983-2007 Sun Microsystems, Inc.  All rights reserved.
Use is subject to license terms.

panic[cpu0]/thread=fec1ede0: Can't handle mwait size 0

fec37e70 unix:mach_alloc_mwait+72 (fec2006c)
fec37e8c unix:mach_init+b0 (c0ce80, fe800010, f)
fec37eb8 unix:psm_install+95 (fe84166e, 3, fec37e)
fec37ec8 unix:startup_end+93 (fec37ee4, fe91731e,)
fec37ed0 unix:startup+3a (fe800010, fec33c98,)
fec37ee4 genunix:main+1e ()

skipping system dump - no dump device configured
rebooting...

this behavior loops endlessly
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] nv-69 install panics dell precision 670

2007-08-14 Thread Bill Moloney
Thanks all for the details on this bug, looks like nv-70 should work for me 
when the drop is available

I've been using an older P3 based server to test the new separate ZIL device 
feature that became available in nv-68, using a FC flash drive as a log device 
outside the zpool itself

I wanted to do some additional testing using the faster Dell 670 system but 
could not get nv-68 or 69 to install ... now I know why ... Bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZVOLs and O_DSYNC, fsync() behavior

2007-06-25 Thread Bill Moloney
I've spent some time searching, and I apologize if I've missed this somewhere,
but in testing ZVOL write performance I cannot see any noticeable difference 
between opening a ZVOL with or without O_DSYNC.  

Does the O_DSYNC flag have any actual influence on ZVOL writes ?

For ZVOLS that I have opened without the O_DSYNC flag, I find that
if I follow each write (these are 4KB writes done to a previously written area 
of a ZVOL) with an fsync() call on the channel, I see a significant performance 
drop (as expected).

But I do not see this behavior when I open the ZVOL with
the O_DSYNC flag (and do not do the fsync() operations) as I thought I should.

While the O_DSYNC flag is accepted without error when opening a ZVOL it
apparently does not support synchronous writes to that ZVOL ... is this
correct or am I missing something ?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Re[2]: Re: How does ZFS write data to disks?

2007-05-17 Thread Bill Moloney
this is not a problem we're trying to solve, but part of a characterization 
study of the zfs implementation ... we're currently using the default 8KB 
blocksize for our zvol deployment, and we're performing tests using write block 
sizes as small as 4KB and as large as 1MB as previously described (including an 
8KB write aligned to logical zvol block zero, for a perfect match to the zvol 
blocksize) ... in all cases we see at least twice the IO to the disks than we 
generate from our test program (and it's much worse for smaller write block 
sizes) ... we're not exactly caught in read-modify-write hell (except when we 
write the 4KB blocks that are smaller than the zvol blocksize), it's more like 
modify-write hell since the original meta-data that maps the 2GB region we're 
writing is probably just read once and kept in cache for the duration of the 
test ... the large amount of back-end IO is almost entirely write operations, 
but these write operations include the re-writing of meta-data that has to 
change to reflect the re-location of newly written data (remember, no in-place 
writes ever occur for data or meta-data) ... using the default zvol block size 
of 8KB, zfs requires, in just block-pointer meta-data, about 1.5% of the total 
2GB write region (this is a large percentage vs other file systems like ufs, 
for example, because zfs uses a 128 byte block pointer vs a ufs 8 byte block 
pointer) ... as new data is written over the old data, the leaves of the 
meta-data tree are necessarily changed to point to the new locations on disk of 
the new data, but any new leaf block-pointer requires that a new block of leaf 
pointers be allocated and written, which requires that the next indirect level 
up from these leaves point to this new set of leaf pointers, so it must be 
rewritten itself, and so on up the tree (and remember, meta-data is subject to 
being written in up to 3 copies - default is 2 - anytime any of it is written 
to disk) ... the indirect pointer blocks closer to the root of the tree may 
only see a single pointer change over the course of a 5 second consolidation 
(based on the size of the zvol, the size of the block allocation unit in the 
zvol and the amount of data actually written to the zvol in 5 seconds), but a 
complete new indirect block must be created and written to disk (all the way 
back to the uberblock) on each transaction group write ... this means that some 
of these meta-data blocks are written to disk over and over again with only 
small changes from their previous composition ... consolidating for more than 5 
seconds would help to mitigate this situation, but longer consolidation periods 
put more data at risk of being lost in case of a power failure ... this is not 
particularly a problem, just a manifestation of the need to never write 
in-place, a rather large block pointer size and the possible writing of 
multiple copies of meta-data (of course this block pointer carries check sums, 
and the addresses of up to 3 duplicate blocks, providing the excellent data and 
meta-data protection zfs is so well known for) ... the original thread that 
this reply addressed was the characteristic 5 second delay in writes, which I 
tried to explain in the context of copy-on-write consolidation, but it's clear 
that even this delay cannot prevent the modification and re-writing of the same 
basic meta-data many times with small modifications
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: How does ZFS write data to disks?

2007-05-16 Thread Bill Moloney
writes to ZFS objects have significant data and meta-data implications, based 
on the zfs copy-on write implementation ... as data is written into a file 
object, for example, this update must eventually be written to a new location 
on physical disk, and all of the meta-data (from the uberblock down to this 
object) must be updated and re-written to a new location as well ... while in 
cache, the changes to these objects can be consolidated, but once written out 
to disk, any further changes would make this recent write obsolete and require 
it  all to be written once again to yet another new location on the disk ... 
batching transactions for 5 seconds (the trigger discussed in zfs 
documentation) ... is essential to limiting the amount of redundant re-writing 
that takes place to physical disk ... keeping a disk busy 100% of the time by 
writing mostly the same data over and over makes far less sense than collecting 
a group of changes in cache and writing them efficiently every trigger period 
of time ... even with this  optimization, our experience with small, sequential 
writes (4KB or less)  to zvols that have been previously written (to ensure the 
mapping of real space on the physical disk) for example, show bandwidth values 
that are less than 10%  of comparable larger (128KB or larger) writes ... you 
can see this behavior dramatically if you compare the amount of host initiated 
write data (front-end data) to the actual amount of IO  performed to the 
physical disks (both reads and writes) to handle the host's  front-end request 
... for example, doing sequential 1MB writes to a  (previously written) zvol 
(simple catenation of 5 FC drives in a JBOD) and writing 2GB of data induced 
more than 4GB of IO to the drives (with smaller write sizes this ratio gets 
progressively worse)
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: The ZFS MOS and how DNODES are stored

2007-02-07 Thread Bill Moloney
Thanks for the input Darren, but I'm still confused about DNODE atomicity ... 
it's difficult to imagine that a change that is made anyplace in the zpool 
would require copy operations all the way back up to the uberblock (e.g. if 
some single file in one of many file systems in a zpool was suddenly changed, 
making a new copy of all of the interceeding objects in the tree back to the 
uberblock would seem to be an untenable amount of work even though it may all 
be carried out in memory and not involve any IO, although if the zpool itself 
was under snapshot control this would have to happen) ... the DNODE 
implementation appears to include its own checksum field (self-checksumming), 
and controlling 
DNODEs (those that lead to decendent collections of DNODEs) are always of the 
known type DMU_OT_DNODE and so their block pointers do not have to checksum the 
DNODEs they point to (unlike all other block pointers that do cehcksum the data 
they point to) ... this would allow for inplace updates of a DNODE, without the 
need to continue further up the tree ... since all objects are controlled by a 
DNODE, updates to an object's data can stop at its DNODE if that DNODE is not 
under some snapshot or clone control ... if this is not the case, than 'any' 
modification in the zpool would require copying up to the uberblock
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] The ZFS MOS and how DNODES are stored

2007-02-06 Thread Bill Moloney
ZFS documentation lists snapshot limits on any single file system in a pool at 
2**48 snaps, and that seems to logically imply that a snap on a file system 
does not require an update to the pool’s currently active uberblock.  That is 
to say, that if we take a snapshot of a file system in a pool, and then make 
any changes to that file system, the copy on write behavior induced by the 
changes will stop at some synchronization point below the uberblock (presumably 
at or below the DNODE that is the DSL directory for that file system).  
In-place updates to a DNODE that has been allocated in a single sector sized 
ZFS block can be considered atomic, since the sector write will either succeed 
or fail totally, leaving either the old version or the new version, but not a 
combination of the two.  This seems sensible to me, but the description of 
object sets beginning on page 26 of the ZFS On-Disk Specification, states that 
the DNODE type DMU_OT_DNODE (the type of the DNODE that’s included in the 1KB 
objset_phys_t structure) will have a data load of an array of DNODES allocated 
in 128KB blocks, and the picture (Illustration 12 in the spec) shows these 
blocks as containing 1024 DNODES.  Since DNODES are 512 bytes, it would not be 
possible to fit the 1024 DNODES depicted in the illustration and if DNODES did 
live in such an array then they could not be atomically updated in-place.  If 
the blocks in question were actually filled with an array of block pointers 
pointing to single sector sized blocks that each held a DNODE then this would 
account for the 1024 entries per 128KB block shown, since block pointers are 
128 bytes (not the 512 bytes of a DNODE), but in this case wouldn’t such 128KB 
blocks be considered to be indirect block pointers, forcing the dn_nlevels 
field shown in the object set DNODE at the top left of Illustration 12 to be 2, 
instead of the 1 that’s there ?  I’m further confused by the  illustration’s 
use of dotted lines to project the contents of a structure field (as seen in 
the projection of the metadnode field of the objset_phys_t structure found at 
the top of the picture) and arrows to represent pointers (as seen in the 
projection of the block pointer array of the DMU-OT-DNODE type dnode, also at 
the top of the picture), but the blocks pointed to by these block pointers seem 
to actually contain instances of DNODES (as seen from the projection of one of 
these instances in the lower part of the picture).  Should this projection be 
replaced by a pointer to the lower DNODE ?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS limits on zpool snapshots

2007-02-01 Thread Bill Moloney
The ZFS On-Disk specification and other ZFS documentation describe the labeling 
scheme used for the vdevs that comprise a ZFS pool.  A label entry contains, 
among other things, an array of uberblocks, one of which will point to the 
active object set of the pool it is a part of at a given instant (according to 
documentation, the active uberblock for a given pool could be located in the 
uberblock array of any vdev participating in the pool at a given instant, and 
is subject to relocation from vdev to vdev as the uberblock for the pool is 
recreated in an update).  Recreation of the active uberblock would occur, for 
example, if we took a snapshot of the pool and changes were then made anywhere 
in the pool.  Since a new uberblock is required in this snapshot scenario, and 
since it appears that the uberblocks are treated as a kind of circular list 
across vdevs, it seems to me that the number of available snapshots we could 
have of a pool at any given instant would be strictly limited to the number of 
available uberblocks in the vdevs of the pool (128 uberblocks per vdev, if I 
have that straight).  Is this truly the case or am I missing something here ?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss