One issue is what we mean by saturation. It's easy to bring a disk to 100%
busy. We need to keep this discussion in the context of a workload. Generally
when people care about streaming throghput of a disk, it's because they are
reading or writing a single large file, and they want to reach
Anything that attempts to append characters on the end of the filename
will run into trouble when the file name is already at NAME_MAX.
One simple solution is to restrict the total length of the name to NAME_MAX,
truncating the original filename as necessary to allow appending. This does
We'll be much better able to help you reach your performance goals
if you can state them as performance goals.
In particular, knowing the latency requirements is important.
Uncompressed HD video runs at 1.5 Gbps; two streams would require 3 Gbps, or
375 MB/sec. The requirement for real-time
What about small random writes? Won't those also require reading from all disks
in RAID-Z to read the blocks for update, where in mirroring only one disk need
be accessed? Or am I missing something?
(It seems like RAID-Z is similar to RAID-3 in its performance characteristics,
since both
The write cache decouples the actual write to disk from the data transfer from
the host. For a streaming operation, this means that the disk can typically
stream data onto tracks with almost no latency (because the cache can aggregate
multiple I/O operations into full tracks which can be
Actually, while Seagate's little white paper doesn't explicitly say so, the
FLASH is used for a write cache and that provides one of the major benefits:
Writes to the disk rarely need to spin up the motor. Probably 90+% of all
writes to disk will fit into the cache in a typical laptop
In RAID-Z, the width of the stripe can vary. For a small block (such as would
hold 512 bytes or less of file data), the stripe will be 1 data block and
either 1 or 2 parity blocks. So a full-stripe write will simply look like
either mirroring (RAIDZ1) or mirror plus one additional ECC block
I'd filed 6452505 (zfs create should set permissions on underlying mountpoint)
so that this shouldn't cause problems in the future
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
So while I'm feeling optimistic :-) we really ought to be able to do this in
two I/O operations. If we have, say, 500K of data to write (including all of
the metadata), we should be able to allocate a contiguous 500K block on disk
and write that with a single operation. Then we update the
One problem with this approach is that software expects /var/mail to be full of
files, not directories, for each user. I don't think you can get the right
semantics out of ZFS for this yet (loopback mounting a file comes to mind, but
breaks down if something tries to delete the user's mailbox
Delivering into $HOME raises some new failure modes if the home directory
servers are NFS mounted, but otherwise often works OK. However, in some cases
it's simply impossible--for instance, in a secure NFS environment where the
home directory can't be mounted without a Kerberos ticket.
I think
Yes, ZFS uses this command very frequently. However, it only does this if the
whole disk is under the control of ZFS, I believe; so a workaround could be to
use slices rather than whole disks when creating a ZFS pool on a buggy device.
This message posted from opensolaris.org
Bill,
I realized just now that we're actually sending the wrong variant of
SYNCHRONIZE CACHE, at least for SCSI devices which support SBC-2.
SBC-2 (or possibly even SBC-1, I don't have it handy) added the SYNC_NV bit to
the command. If SYNC_NV is set to 0, the device is required to flush data
Filed as 6462690.
If our storage qualification test suite doesn't yet check for support of this
bit, we might want to get that added; it would be useful to know (and gently
nudge vendors who don't yet support it).
This message posted from opensolaris.org
The bigger problem with system utilization for software RAID is the cache, not
the CPU cycles proper. Simply preparing to write 1 MB of data will flush half
of a 2 MB L2 cache. This hurts overall system performance far more than the few
microseconds that XORing the data takes.
(A similar
A determined administrator can always get around any checks and cause problems.
We should do our very best to prevent data loss, though! This case is
particularly bad since simply booting a machine can permanently damage the pool.
And why would we want a pool imported on another host, or not
The biggest problem I see with this is one of observability, if not all
of the data is encrypted yet what should the encryption property say ?
If it says encryption is on then the admin might think the data is
safe, but if it says it is off that isn't the truth either because
some of it maybe
True - I'm a laptop user myself. But as I said, I'd assume the whole disk
would fail (it does in my experience).
That's usually the case, but single-block failures can occur as well. They're
rare (check the uncorrectable bit error rate specifications) but if they
happen to hit a critical file,
And if we are still writing to the file systems at that time ?
New writes should be done according to the new state (if encryption is being
enabled, all new writes are encrypted), since the goal is that eventually the
whole disk will be in the new state.
The completion percentage should
It would be interesting to have a zfs enabled HBA to offload the checksum
and parity calculations. How much of zfs would such an HBA have to
understand?
That's an interesting question.
For parity, it's actually pretty easy. One can envision an HBA which took a
group of related write commands
just measured quickly that a 1.2Ghz sparc can do [400-500]MB/sec
of encoding (time spent in misnamed function
vdev_raidz_reconstruct) for a 3 disk raid-z group.
Strange, that seems very low.
Ah, I see. The current code loops through each buffer, either copying or XORing
it into the parity.
With ZFS however the in-between cache is obsolete, as individual disk caches
can be used directly. I also openly question whether even the dedicated RAID
HW is faster than the newest CPUs in modern servers.
Individual disk caches are typically in the 8-16 MB range; for 15 disks, that
gives you
I think there are at least two separate issues here.
The first is that ZFS doesn't support multiple hosts accessing the same pool.
That's simply a matter of telling people. UFS doesn't support multiple hosts,
but it doesn't have any special features to prevent administrators from
*trying* it.
If I'm reading the source correctly, for the $60xx boards, the only supported
revision is $09. Yours is $07, which presumably has some errata with no
workaround, and which the Solaris driver refuses to support. Hope you can
return it ... ?
This message posted from opensolaris.org
A quick peek at the Linux source shows a small workaround in place for the 07
revision...maybe if you file a bug against Solaris to support this revision it
might be possible to get it added, at least if that's the only issue.
This message posted from opensolaris.org
Is this true for single-sector, vs. single-ZFS-block, errors? (Yes, it's
pathological and probably nobody really cares.) I didn't see anything in the
code which falls back on single-sector reads. (It's slightly annoying that the
interface to the block device drivers loses the SCSI error
If you *never* want to import a pool automatically on reboot you just have to
delete the
/etc/zfs/zpool.cache file before the zfs module is being loaded.
This could be integrated into SMF.
Or you could always use import -R / create -R for your pool management. Of
course, there's no way to
Mirroring is more efficient for small reads (the size of one ZFS block or less)
because only one disk has to be accessed. Since RAID-Z spreads a ZFS block
across multiple disks, and the data from all disks is required to verify the
checksum, every read accesses every disk.
Mirror read:
1.
Actually, random writes on a RAID-5, while not performing that well because of
the pre-read, don't require a full stripe read (or write). They only require
reading the old data and parity, then writing the new data and parity. This is
quite a bit better than a full stripe, since only two
Hi Mitchell,
I do work for Sun, but I don't consider myself biased towards the slab
allocator or any other Solaris or Sun code. I know we've got plenty of
improvements to make!
That said, your example is not multi-threaded. There are two major performance
issues which come up with a list
ClearCase is a version control system, though — not the same as file versioning.
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I think our problem is that we look at FV from different angles. I look
at it from the point of view of people who have NEVER used FV, and you
look at it from the view of people who have ALWAYS used FV.
That's certainly a part of it. It's interesting reading this discussion, as
someone who
People are oriented to their files, not to snapshots.
True, though with NetApp-style snapshots, it's not that difficult to translate
'src/file.c' to '.snapshot/hourly.0/src/file.c' and see what it was like an
hour ago. I imagine that a syntax like '.snapshot/22:20/src/file.c' would also
be
Versioning cannot be automated; taking periodic snapshots != capturing
application state.
But I think we have existence proofs of operating systems which do automate
versioning.
It's true that capturing a new version each time a file has been modified and
closed may not be perfect, but if it
I'm showing my lack of knowledge on this one but I thought SAM-FS could
do something like this. Anyone know for sure?
It's not quite the same, and not out-of-the-box.
SAM-FS has the ability to create an archive copy of files onto disk or tape
when the files are closed after having been
The scan order won't make any difference to ZFS, as it identifies the drives by
a label written to them, rather than by their controller path.
Perhaps someone in ZFS support could analyze the panic to determine the cause,
or look at the disk labels; have you made the core file available to Sun?
The configuration data is stored on the disk devices themselves, at least
primarily.
There is also a copy of the basic configuration data in the file
/etc/zfs/zpool.cache on the boot device. If this file is missing, ZFS will not
automatically import pools, but you can manually import them.
Mirroring will give you the best performance for small write operations.
If you can get by with two disks, I’d divide each of them into two slices, s0
and s1, say. Set up an SVM mirror between d0s0 and d1s0 and use that for your
root. Set up a ZFS mirror between d0s1 and d1s1 and use that for
fsync() should theoretically be better because O_SYNC requires that each
write() include writing not only the data but also the inode and all indirect
blocks back to the disk.
This message posted from opensolaris.org
___
zfs-discuss mailing list
Yes, set the block size to 8K, to avoid a read-modify-write cycle inside ZFS.
As you suggest, using a separate mirror for the transaction log will only be
useful if you're on different disks -- otherwise you will be forcing the disk
head to move back and forth between slices each time you
Most ZFS improvements should be available through patches. Some may require
moving to a future update (for instance, ZFS booting, which may have other
implications throughout the system).
On most systems, you won’t see a lot of difference between hardware or software
mirroring.
The benefit of
For what it's worth, close-to-open consistency was added to Linux NFS in the
2.4.20 kernel (late 2002 timeframe). This might be the source of some of the
confusion.
This message posted from opensolaris.org
___
zfs-discuss mailing list
One technique would be to keep a histogram of read write sizes.
Presumably one would want to do this only during a “tuning phase” after the
file was first created, or when access patterns change. (A shift to smaller
record sizes can be detected by a large proportion of write operations which
No, the reason to try to match recordsize to the write size is so that a small
write does not turn into a large read + a large write. In configurations where
the disk is kept busy, multiplying 8K of data transfer up to 256K hurts.
This is really orthogonal to the cache — in fact, if we had a
Our thinking is that if you want more redundancy than RAID-Z, you should
use RAID-Z with double parity, which provides more reliability and more
usable storage than a mirror of RAID-Zs would.
This is only true if the drives have either independent or identical failure
modes, I think. Consider
I don't see how you can get both end-to-end data integrity and
read avoidance.
Checksum the individual RAID-5 blocks, rather than the entire stripe?
In more detail: Allow the pointer to the block to contain one checksum per
device used (the count will vary if you're using a RAID-Z style
A UFS file system has a fixed number of inodes, set when the file system is
created. df can simply report how many of those have been used, and how many
are free.
Most file systems, including ZFS and QFS, allocate inodes dynamically. In this
case, there really isn’t a “number of files free”
The reason that I want to use up the inode is we need to test the
behaviors in the case of both block and inode are used up. If only fill
up the block, creating an empty file still succeeds.
Pretty much the only way to tell if you've used up all the space available for
file nodes is to
I'd appreciate it if only people who have made changes to the ZFS
codebase found in opensolaris respond further to this thread.
Well. I haven't made changes, but I can read code.
When replacing a device, ZFS internally takes the device being replaced and
creates a mirror between the old and
With zfs, there's this ominous
message saying destroy the filesystem and restore
from tape. That's not so good, for one corrupt
file.
It is strictly correct that to restore the data you'd need
to refer to a backup, in this case.
It is not, however, correct that to restore the data you
No, you still have the hardware problem.
What hardware problem?
There seems to be an unspoken assumption that any checksum error detected by
ZFS is caused by a relatively high error rate in the underlying hardware.
There are at least two classes of hardware-related errors. One class are those
RAID level what? How is anything salvagable if you
lose your only copy? [ ... ]
ZFS does store multiple copies of metadata in a
single vdev, so I
assume we're talking about data here.
I believe we're talking about metadata, as that is the case where ZFS reports
that the pool (as opposed
Bit errors happen. When they do, data is corrupted.
This is rather an oversimplification.
Single-bit errors *on the media* happen relatively frequently. In fact,
multi-bit errors are not too uncommon either. Hence there is a lot of
error-correction data written to the disc media.
The
It is possible to configure ZFS in the way you describe, but your performance
will be limited by the older array.
All mirror writes have to be stored on both arrays before they are considered
complete, so writes will be as slow as the slowest disk or array involved.
ZFS does not currently
And to panic? How can that in any sane way be good
way to protect the application?
*BANG* - no chance at all for the application to
handle the problem...
I agree -- a disk error should never be fatal to the system; at worst, the file
system should appear to have been forcibly unmounted (and
is there any command to know the presence of ZFS file system on a device ?
fstyp is the Solaris command to determine what type of file system may be
present on a disk:
# fstyp /dev/dsk/c0t1d0s6
zfs
When a device is shared between two machines [ ... ]
You can use the same mount/unmount
But it's still not the application's problem to handle the underlying
device failure.
But it is the application's problem to handle an error writing to the file
system -- that's why the file system is allowed to return errors. ;-)
Some applications might not check them, some applications
You specify the mirroring configuration. The top-level vdevs are implicitly
striped. So if you, for instance, request something like
zpool create mirror AA BA mirror AB BB
then you will have a pool consisting of a stripe of two mirrors. Each mirror
will have one copy of its data at each
I think the pool is busted. Even the message printed in your
previous email is bad:
DATASET OBJECT RANGE
15 0 lvl=4294967295 blkid=0
as level is way out of range.
I think this could be from dmu_objset_open_impl().
It sets object to 0 and level to -1 (=
Creating an array configuration with one element being a sparse file, then
removing that file, comes to mind, but I wouldn't want to be the first to
attempt it. ;-)
This message posted from opensolaris.org
___
zfs-discuss mailing list
I'm still confused though, I believe that locking an adaptive mutex will spin
for a short
period then context switch and so they shouldn't be burning CPU - at least
not .4s worth!
An adaptive mutex will spin as long as the thread which holds the mutex is on
CPU. If the lock is moderately
This does look like the ATA driver bug rather than a ZFS issue per se.
(For the curious, the reason ZFS triggers this when UFS doesn't is because ZFS
sends a synchronize cache command to the disk, which is not handled in DMA mode
by the controller; and for this particular controller, switching
If your database performance is dominated by sequential reads, ZFS may not be
the best solution from a performance perspective. Because ZFS uses a
write-anywhere layout, any database table which is being updated will quickly
become scattered on the disk, so that sequential read patterns become
NetApp can actually grow their RAID groups, but they recommend adding an entire
RAID group at once instead. If you add a disk to a RAID group on NetApp, I
believe you need to manually start a reallocate process to balance data across
the disks.
This message posted from opensolaris.org
Are you looking purely for performance, or for the added reliability that ZFS
can give you?
If the latter, then you would want to configure across multiple LUNs in either
a mirrored or RAID configuration. This does require sacrificing some storage in
exchange for the peace of mind that any
Is there an easy way to determine whether a pool has this fix applied or not?
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
I think you may be observing that fsync() is slow.
The file will be written, and visible to other processes via the in-memory
cache, before the data has been pushed to disk. vi forces the data out via
fsync, and that can be quite slow when the file system is under load,
especially before a fix
Also note that the UB is written to every vdev (4 per disk) so the
chances of all UBs being corrupted is rather low.
The chances that they're corrupted by the storage system, yes.
However, they are all sourced from the same in-memory buffer, so an undetected
in-memory error (e.g. kernel bug)
Were looking for pure performance.
What will be contained in the LUNS is Student User
account files that they will access and Department
Share files like, MS word documents, excel files,
PDF. There will be no applications on the ZFS
Storage pools or pool Does this help on what
strategy
It took manufacturers of SCSI drives some years to get this right. Around 1997
or so we were still seeing drives at my former employer that didn't properly
flush their caches under all circumstances (and had other interesting
behaviours WRT caching).
Lots of ATA disks never did bother to
If the SCSI commands hang forever, then there is nothing that ZFS can
do, as a single write will never return. The more likely case is that
the commands are continually timining out with very long response times,
and ZFS will continue to talk to them forever.
It looks like the sd driver
The implication in what you've written is that ZFS doesn't report an error if
it detects an invalid checksum. Is that correct?
No, sorry I wasn't more clear.
ZFS detects and reports the invalid checksum. If the checksum error occurs on a
directory, this can result in an error being returned
Just to make sure there's no confusion ;-), this error message was added to
'ls' after Solaris 10, and hasn't been backported yet. (Bug 4985395, *ls* does
not report errors from getdents().)
This message posted from opensolaris.org
___
zfs-discuss
I have a Sun SE 3511 array with 5 x 500 GB SATA-I disks in a RAID 5. This
2 TB logical drive is partitioned into 10 x 200GB slices. I gave 4 of these
slices to a
Solaris 10 U2 machine and added each of them to a concat (non-raid) zpool as
listed below:
This is certainly a supportable
BTW, Jeff's posts to zfs-discuss are being rejected with this message [ ... ]
... while the spam is coming through loud clear. ;-)
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
I thought this is what the T10 OSD spec was set up to address. We've already
got device manufacturers beginning to design and code to the spec.
Precisely. The interface to block-based devices forces much of the knowledge
that the file system and application have about access patterns to be
INFORMATION: If a member of this striped zpool becomes unavailable or
develops corruption, Solaris will kernel panic and reboot to protect your
data.
OK, I'm puzzled.
Am I the only one on this list who believes that a kernel panic, instead of
EIO, represents a bug?
This message posted
Unfortunately there are some cases, where the disks lose data,
these cannot be detected by traditional filesystems but with ZFS:
* bit rot: some bits on the disk gets flipped (~ 1 in 10^11)
* phantom writes: a disk 'forgets' to write data (~ 1 in 10^8)
* misdirected reads/writes: disk
Do you have more than one snapshot?
If you have a file system a, and create two snapshots [EMAIL PROTECTED] and
[EMAIL PROTECTED], then any space shared between the two snapshots does not
get accounted for anywhere visible. Only once one of those two is deleted, so
that all the space is
Hmm... But, how is my current configuration (1 striped zpool consisting of
4 x 200 GB luns from a hardware RAID 5 logical drive) analogous to
taking a single disk, partitioning it into several partitions, then
striping across those partitions if each 200 GB lun is presented to
solaris as a
Good point. Verifying that the new überblock is readable isn’t actually
sufficient, since it might become unreadable in the future. You’d need to wait
for several transaction groups, until the block was unreachable by the oldest
remaining überblock, to be safe in this sense.
On the other
In our recent experience RAID-5 due to the 2 reads, a XOR calc and a
write op per write instruction is usually much slower than RAID-10
(two write ops). Any advice is greatly appreciated.
RAIDZ and RAIDZ2 does not suffer from this malady (the RAID5 write hole).
1. This isn't the write
Is there some reason why a small read on a raidz2 is not statistically very
likely to require I/O on only one device? Assuming a non-degraded pool of
course.
ZFS stores its checksums for RAIDZ/RAIDZ2 in such a way that all disks must be
read to compute and verify the checksum.
This
What happens when a sub-block is missing (single disk failure)? Surely
it doesn't have to discard the entire checksum and simply trust the
remaining blocks?
The checksum is over the data, not the data+parity. So when a disk fails,
the data is first reconstructed, and then the block checksum
DIRECT IO is a set of performance optimisations to circumvent shortcomings of
a given filesystem.
Direct I/O as generally understood (i.e. not UFS-specific) is an optimization
which allows data to be transferred directly between user data buffers and
disk, without a memory-to-memory copy.
If [SSD or Flash] devices become more prevalent, and/or cheaper I'm curious
what
ways ZFS could be made to bast take advantage of them?
The intent log is a possibility, but this would work better with SSD than
Flash; Flash writes can actually be slower than sequential writes to a real
disk.
It's not about the checksum but about how a fs block is stored in
raid-z[12] case - it's spread out to all non-parity disks so in order
to read one fs block you have to read from all disks except parity
disks.
However, if we didn't need to verify the checksum, we wouldn't
have to read the
Summary (1.8 form factor): write: 35MB/Sec, Read: 62MB/Sec IOPS: 7,000
That is on par with a 5400 rpm disk, except for the 100x more small, random
read iops. The biggest issue is the pricing, which will become interestingly
competitive for mortals this year.
$600+ for a 32 GB device
$600+ for a 32 GB device isn't exactly competitive,
though the low-power and random access are
attractive.
Look at previous SSD offerings. $600 is a steal. ;)
This isn't a performance-oriented SSD, since it's using Flash RAM (limited
lifetime, slow writes). It's really meant as a hard
Turnaround question - why *should* ZFS define an underlying
storage arrangement at the filesystem level?
It would be nice to provide it at the directory hierarchy level, but
since file systems in ZFS are cheap, providing it at the file system
level instead might be reasonable. (I say might be
Yes, Anantha is correct that is the bug id, which could be responsible
for more disk writes than expected.
I believe, though, that this would explain at most a factor of 2 of write
expansion (user data getting pushed to disk once in the intent log, then again
in its final location). If the
To me, hard drives today are as much a commodity item as network cable,
GBICs, NICs, DVD drives, etc.
They are and they aren't. Reliability, particularly in high-heat vibration
environments, can vary quite a bit.
For sun to charge 4-8 times street price for hard drives that they order just
1. How stable is ZFS?
It's a new file system; there will be bugs. It appears to be well-tested,
though. There are a few known issues; for instance, a write failure can panic
the system under some circumstances. UFS has known issues too
2. Recommended config. Above, I have a fairly
How badly can you mess up a JBOD?
Two words: vibration, cooling.
Three more: power, signal quality.
I've seen even individual drive cases with bad enough signal quality to cause
bit errors.
This message posted from opensolaris.org
___
Often, the spare is up and running but for whatever reason you'll have a
bad block on it and you'll die during the reconstruct.
Shouldn't SCSI/ATA block sparing handle this? Reconstruction should be purely
a matter of writing, so bit rot shouldn't be an issue; or are there cases I'm
not
The affected DIMM? Did you have memory errors before this?
The message you posted looked like a ZFS encountered an error writing to the
drive (which could, admittedly, have been caused by bad memory).
This message posted from opensolaris.org
___
In general, your backup software should handle making incremental dumps, even
from a split mirror. What are you using to write data to tape? Are you simply
dumping the whole file system, rather than using standard backup software?
ZFS snapshots use a pure copy-on-write model. If you have a
The space management algorithms in many file systems don't always perform well
when they can't find a free block of the desired size. There's often a cliff
where on average, once the file system is too full, performance drops off
exponentially. UFS deals with this by reserving space explicitly
It turns out that even rather poor prediction accuracy is good enough to make a
big difference (10x) in the failure probability of a RAID system.
See Gordon Hughes Joseph Murray, Reliability and Security of RAID Storage
Systems and D2D Archives Using SATA Disk Drives, ACM Transactions on
It's possible (if unlikely) that you are only getting checksum errors on
metadata. Since ZFS always internally mirrors its metadata, even on
non-redundant pools, it can recover from metadata corruption which does not
affect all copies. (If there is only one LUN, the mirroring happens at
1 - 100 of 205 matches
Mail list logo