[zfs-discuss] scrub percentage complete decreasing, but without snaps.

2007-12-14 Thread Ian Collins
I've seen the problems with bug 6343667, but I haven't seen the problem
I have at the the moment.

I started a scrub of a b72 system that doesn't have any recent snapshots
(none since the last scrub) and the % complete is cycling:

 scrub: scrub in progress, 69.08% done, 0h13m to go
 scrub: scrub in progress, 46.63% done, 0h28m to go
 scrub: scrub in progress, 6.36% done, 1h37m to go
 scrub: scrub in progress, 2.09% done, 1h11m to go
 scrub: scrub in progress, 0.02% done, 33h17m to go
 scrub: scrub in progress, 0.00% done, 44h39m to go
 scrub: scrub in progress, 0.00% done, 43h17m to go
 scrub: scrub in progress, 0.00% done, 35h6m to go
 scrub: scrub in progress, 1.97% done, 1h6m to go
 scrub: scrub in progress, 4.16% done, 1h21m to go
 scrub: scrub in progress, 3.91% done, 1h15m to go
 scrub: scrub in progress, 1.62% done, 1h10m to go
 scrub: scrub in progress, 0.41% done, 2h6m to go
 scrub: scrub in progress, 0.02% done, 31h18m to go
config:

NAMESTATE READ WRITE CKSUM
export  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c3d0ONLINE   0 0 0
c1d0ONLINE   0 0 0
  mirrorONLINE   0 0 0
c4d0ONLINE   0 0 0
c6d0ONLINE   0 0 0
  mirrorONLINE   0 0 0
c7d0ONLINE   0 0 0
c8d0ONLINE   0 0 0
  mirrorONLINE   0 0 0
c9d0ONLINE   0 0 0
c10d0   ONLINE   0 0 0

errors: No known data errors

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-14 Thread can you guess?
 On Dec 14, 2007 1:12 AM, can you guess?
 [EMAIL PROTECTED] wrote:
   yes.  far rarer and yet home users still see
 them.
 
  I'd need to see evidence of that for current
 hardware.
 What would constitute evidence?  Do anecdotal tales
 from home users
 qualify?  I have two disks (and one controller!) that
 generate several
 checksum errors per day each.

I assume that you're referring to ZFS checksum errors rather than to transfer 
errors caught by the CRC resulting in retries.

If so, then the next obvious question is, what is causing the ZFS checksum 
errors?  And (possibly of some help in answering that question) is the disk 
seeing CRC transfer errors (which show up in its SMART data)?

If the disk is not seeing CRC errors, then the likelihood that data is being 
'silently' corrupted as it crosses the wire is negligible (1 in 65,536 if 
you're using ATA disks, given your correction below, else 1 in 4.3 billion for 
SATA).  Controller or disk firmware bugs have been known to cause otherwise 
undetected errors (though I'm not familiar with any recent examples in normal 
desktop environments - e.g., the CERN study discussed earlier found a disk 
firmware bug that seemed only activated by the unusual demands placed on the 
disk by a RAID controller, and exacerbated by that controller's propensity just 
to ignore disk time-outs).  So, for that matter, have buggy file systems.  
Flaky RAM can result in ZFS checksum errors (the CERN study found correlations 
there when it used its own checksum mechanisms).

  I've also seen
 intermittent checksum
 fails that go away once all the cables are wiggled.

Once again, a significant question is whether the checksum errors are 
accompanied by a lot of CRC transfer errors.  If not, that would strongly 
suggest that they're not coming from bad transfers (and while they could 
conceivably be the result of commands corrupted on the wire, so much more data 
is transferred compared to command bandwidth that you'd really expect to see 
data CRC errors too if commands were getting mangled).  When you wiggle the 
cables, other things wiggle as well (I assume you've checked that your RAM is 
solidly seated).

On the other hand, if you're getting a whole bunch of CRC errors, then with 
only a 16-bit CRC it's entirely conceivable that a few are sneaking by 
unnoticed.

 
  Unlikely, since transfers over those connections
 have been protected by 32-bit CRCs since ATA busses
 went to 33 or 66 MB/sec. (SATA has even stronger
 protection)
 The ATA/7 spec specifies a 32-bit CRC (older ones
 used a 16-bit CRC)
 [1].

Yup - my error:  the CRC was indeed introduced in ATA-4 (33 MB/sec. version), 
but was only 16 bits wide back then.

  The serial ata protocol also specifies 32-bit
 CRCs beneath 8/10b
 coding (1.0a p. 159)[2].  That's not much stronger at
 all.

The extra strength comes more from its additional coverage (commands as well as 
data).

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-14 Thread Casper . Dik

...
though I'm not familiar with any recent examples in normal desktop environments



One example found during early use of zfs in Solaris engineering was
a system with a flaky power supply.

It seemed to work just fine with ufs but when zfs was installed the
sata drives started to shows many ZFS checksum errors.

After replacing the powersupply, the system did not detect any more
errors.

Flaky powersupplies are an important contributor to PC unreliability; they
also tend to fail a lot in various ways.

Casper

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs snapshot leaking data ?

2007-12-14 Thread Guy
Hello every ZFS gurus


I've been using a ZFS server for about one year now (for rsync-based disk 
backup purpose).

The process is quite simple :
I backup each fs using rsync.
After each filesystem backup, I take a zfs snapshot to freeze read-only the 
saved data.
So I end up with a zfs snapshot for each backup set (one per day).



When I do a zfs list -r, I can see all the snapshots with the size occupied by 
each snapshot. Something proportional to the number of disks blocks that 
changed since the previous snapshot.

I'm surprised to see that the last snapshot is never empty when the snapshot is 
taken automatically by the backup script.
But if I take a snapshot several hours after the backup script has run, the 
snapshot size is 0.



Is there some data missing in the snapshot if I take it right after writing to 
the filesystem ?
Should I wait for some time, so that the zfs buffer cache is written to disk ???
If so, how long ?


Has anyone experienced this kind of symptoms ?

Thank you for your help.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Trial x4500, zfs with NFS and quotas.

2007-12-14 Thread Shawn Ferry

On Dec 14, 2007, at 12:27 AM, Jorgen Lundman wrote:



 Shawn Ferry wrote:
 Jorgen,

 You may want to try running 'bootadm update-archive'

 Assuming that your boot-archive problem is an out of date boot- 
 archive
 message at boot and/or doing a clean reboot to let the system try to
 write an up to date boot-archive.

 Yeah, it is remembering to do so after something has changed that's
 hard. In this case, I had to break the mirror to install OpenSolaris.
 (shame that the CD/DVD, and miniroot, doesn't not have md driver).

 It would be tempting to add the bootadm update-archive to the boot
 process, as I would rather have it come up half-assed, than not come  
 up
 at all.

It is part of the shutdown process, you just need to stop crashing :)



___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] JBOD performance

2007-12-14 Thread Frank Penczek
Hi all,

we are using the following setup as file server:

---
# uname -a
SunOS troubadix 5.10 Generic_120011-14 sun4u sparc SUNW,Sun-Fire-280R

# prtconf -D
System Configuration:  Sun Microsystems  sun4u
Memory size: 2048 Megabytes
System Peripherals (Software Nodes):

SUNW,Sun-Fire-280R (driver name: rootnex)
scsi_vhci, instance #0 (driver name: scsi_vhci)
packages
SUNW,builtin-drivers
deblocker
disk-label
terminal-emulator
obp-tftp
SUNW,debug
dropins
kbd-translator
ufs-file-system
chosen
openprom
client-services
options, instance #0 (driver name: options)
aliases
memory
virtual-memory
SUNW,UltraSPARC-III+
memory-controller, instance #0 (driver name: mc-us3)
SUNW,UltraSPARC-III+
memory-controller, instance #1 (driver name: mc-us3)
pci, instance #0 (driver name: pcisch)
ebus, instance #0 (driver name: ebus)
flashprom
bbc
power, instance #0 (driver name: power)
i2c, instance #0 (driver name: pcf8584)
dimm-fru, instance #0 (driver name: seeprom)
dimm-fru, instance #1 (driver name: seeprom)
dimm-fru, instance #2 (driver name: seeprom)
dimm-fru, instance #3 (driver name: seeprom)
nvram, instance #4 (driver name: seeprom)
idprom
i2c, instance #1 (driver name: pcf8584)
cpu-fru, instance #5 (driver name: seeprom)
temperature, instance #0 (driver name: max1617)
cpu-fru, instance #6 (driver name: seeprom)
temperature, instance #1 (driver name: max1617)
fan-control, instance #0 (driver name: tda8444)
motherboard-fru, instance #7 (driver name: seeprom)
ioexp, instance #0 (driver name: pcf8574)
ioexp, instance #1 (driver name: pcf8574)
ioexp, instance #2 (driver name: pcf8574)
fcal-backplane, instance #8 (driver name: seeprom)
remote-system-console, instance #9 (driver name: seeprom)
power-distribution-board, instance #10 (driver name: seeprom)
power-supply, instance #11 (driver name: seeprom)
power-supply, instance #12 (driver name: seeprom)
rscrtc
beep, instance #0 (driver name: bbc_beep)
rtc, instance #0 (driver name: todds1287)
gpio, instance #0 (driver name: gpio_87317)
pmc, instance #0 (driver name: pmc)
parallel, instance #0 (driver name: ecpp)
rsc-control, instance #0 (driver name: su)
rsc-console, instance #1 (driver name: su)
serial, instance #0 (driver name: se)
network, instance #0 (driver name: eri)
usb, instance #0 (driver name: ohci)
scsi, instance #0 (driver name: glm)
disk (driver name: sd)
tape (driver name: st)
sd, instance #12 (driver name: sd)
 ...
ses, instance #29 (driver name: ses)
ses, instance #30 (driver name: ses)
scsi, instance #1 (driver name: glm)
disk (driver name: sd)
tape (driver name: st)
sd, instance #31 (driver name: sd)
sd, instance #32 (driver name: sd)
...
ses, instance #46 (driver name: ses)
ses, instance #47 (driver name: ses)
network, instance #0 (driver name: ce)
pci, instance #1 (driver name: pcisch)
SUNW,qlc, instance #0 (driver name: qlc)
fp (driver name: fp)
disk (driver name: ssd)
fp, instance #1 (driver name: fp)
ssd, instance #1 (driver name: ssd)
ssd, instance #0 (driver name: ssd)
scsi, instance #0 (driver name: mpt)
disk (driver name: sd)
tape (driver name: st)
sd, instance #0 (driver name: sd)
sd, instance #1 (driver name: sd)
...
ses, instance #14 (driver name: ses)
ses, instance #31 (driver name: ses)
os-io
iscsi, instance #0 (driver name: iscsi)
pseudo, instance #0 (driver name: pseudo)
---

The disks reside in a StoreEdge3320 expansion unit
connected to the machine's SCSI controller card (LSI1030 U320).
We've created a raidz2 pool:

---
# zpool status
  pool: storage_array
 state: ONLINE
 scrub: scrub completed with 0 errors on Wed Dec 12 23:38:36 2007
config:

NAME STATE READ WRITE CKSUM
storage_array  ONLINE   0 0 0
  raidz2 ONLINE   0 0 0
c2t8d0   ONLINE   0 0 0
c2t9d0   ONLINE   0 0 0
c2t10d0  ONLINE   0 0 0
c2t11d0  ONLINE   0 0 0
c2t12d0  ONLINE   0 0 0

errors: No known data errors
---

The throughput when 

Re: [zfs-discuss] ZFS with array-level block replication (TrueCopy, SRDF, etc.)

2007-12-14 Thread Jim Dunham
Steve,

 I have a couple of questions and concerns about using ZFS in an  
 environment where the underlying LUNs are replicated at a block  
 level using products like HDS TrueCopy or EMC SRDF.  Apologies in  
 advance for the length, but I wanted the explanation to be clear.

 (I do realise that there are other possibilities such as zfs send/ 
 recv and there are technical and business pros and cons for the  
 various options. I don't want to start a 'which is best' argument :) )

 The CoW design of ZFS means that it goes to great lengths to always  
 maintain on-disk self-consistency, and ZFS can make certain  
 assumptions about state (e.g not needing fsck) based on that.  This  
 is the basis of my questions.

 1) First issue relates to the überblock.  Updates to it are assumed  
 to be atomic, but if the replication block size is smaller than the  
 überblock then we can't guarantee that the whole überblock is  
 replicated as an entity.  That could in theory result in a corrupt  
 überblock at the
 secondary.

 Will this be caught and handled by the normal ZFS checksumming? If  
 so, does ZFS just use an alternate überblock and rewrite the  
 damaged one transparently?

 2) Assuming that the replication maintains write-ordering, the  
 secondary site will always have valid and self-consistent data,  
 although it may be out-of-date compared to the primary if the  
 replication is asynchronous, depending on link latency, buffering,  
 etc.

 Normally most replication systems do maintain write ordering, [i] 
 except[/i] for one specific scenario.  If the replication is  
 interrupted, for example secondary site down or unreachable due to  
 a comms problem, the primary site will keep a list of changed  
 blocks.  When contact between the sites is re-established there  
 will be a period of 'catch-up' resynchronization.  In most, if not  
 all, cases this is done on a simple block-order basis.  Write- 
 ordering is lost until the two sites are once again in sync and  
 routine replication restarts.

 I can see this has having major ZFS impact.  It would be possible  
 for intermediate blocks to be replicated before the data blocks  
 they point to, and in the worst case an updated überblock could be  
 replicated before the block chains that it references have been  
 copied.  This breaks the assumption that the on-disk format is  
 always self-consistent.

For most implementations of resynchronization, not only are changes  
resilvered in a block-ordered basis, resynchronization is also done  
in a single pass over the volume(s). To address the fact that  
resynchronization happens while additional changes are also being  
replicated, the concept of a resynchronization point is kept. As this  
resynchronization point traverse the volume from beginning to end, I/ 
Os occurring before, or at this point need to be replicated inline,  
whereas I/Os occurring after this point need to marked such that they  
will be replicated later in block order. You are quite correct in  
that the data is not consistent.

 If a disaster happened during the 'catch-up', and the partially- 
 resynchronized LUNs were imported into a zpool at the secondary  
 site, what would/could happen? Refusal to accept the whole zpool?  
 Rejection just of the files affected? System panic? How could  
 recovery from this situation be achieved?

The state of the partially-resynchronized LUNs are much worse than  
you know. During active resynchronization, the remote volume contains  
a mixture of prior write-order consistent data, resilvered block- 
order data, plus new replicated data. Essentially the partially- 
resynchronized LUNs are totally inconsistent until such a times as  
the single pass over all data is 100% complete.

For some, but not all replication software, if the 'catch-up'  
resynchronization failed, read access to the LUNs should be  
prevented, or a least read access while the LUNs are configured as  
remote mirrors. Availability Suite's Remote Mirror software (SNDR)  
marks such volumes as need synchronization and fails all  
application read and write I/Os.

 Obviously all filesystems can suffer with this scenario, but ones  
 that expect less from their underlying storage (like UFS) can be  
 fscked, and although data that was being updated is potentially  
 corrupt, existing data should still be OK and usable.  My concern  
 is that ZFS will handle this scenario less well.

 There are ways to mitigate this, of course, the most obvious being  
 to take a snapshot of the (valid) secondary before starting resync,  
 as a fallback.  This isn't always easy to do, especially since the  
 resync is usually automatic; there is no clear trigger to use for  
 the snapshot. It may also be difficult to synchronize the snapshot  
 of all LUNs in a pool. I'd like to better understand the risks/ 
 behaviour of ZFS before starting to work on mitigation strategies.

Since Availability Suite is both Remote Mirroring and 

[zfs-discuss] LUN configuration for disk-based backups

2007-12-14 Thread Andrew Chace
Hello,

We have a StorageTek FLX280 (very similar to a 6140) with 16 750 GB SATA drives 
that we would like to use for disk-based backups. I am trying to make an 
(educated) guess at what the best configuration for the LUN's on the FLX280 
might be. 

I've read, or at least skimmed, most of the ZFS Best Practices Guide over at 
solarisinternals.com, which has some great information; however, I still do not 
feel like I have a good understanding of the interaction between ZFS and a disk 
array. 

More specifically, I am concerned about the number of IOP's that each drive 
and/or LUN will be able to handle. Seagate lists an average seek time of 9ms, 
and an average rotational latency of 4.16ms for these drives. By my math, each 
drive should be capable of 76 IOP's in a worst case scenario; i.e. completely 
random I/O. 

These drives support native command queuing, and the controllers on the FLX280 
have a battery-backed cache, so I would _assume_ that they are also capable of 
reordering I/O op's to improve throughput to the disks. 

So, the question is whether or not worst-case IOP's are even relevant. If my 
assumptions about the controllers on the FLX280 are correct (documentation?), 
then it seems like we could use RAID-Z to get the throughput we're looking for. 
If not, we may have to go with RAID-1. 

Anyone have any thoughts on this?
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] JBOD performance

2007-12-14 Thread Richard Elling
Frank Penczek wrote:

 The performance is slightly disappointing. Does anyone have
 a similar setup and can anyone share some figures?
 Any pointers to possible improvements are greatly appreciated.

   

Use a faster processor or change to a mirrored configuration.
raidz2 can become processor bound in the Reed-Soloman calculations
for the 2nd parity set.  You should be able to see this in mpstat, and to
a coarser grain in vmstat.
 -- richard

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] JBOD performance

2007-12-14 Thread Louwtjie Burger
 The throughput when writing from a local disk to the
 zpool is around 30MB/s, when writing from a client

Err.. sorry, the internal storage would be good old 1Gbit FCAL disks @
10K rpm. Still, not the fastest around ;)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Bugid 6535160

2007-12-14 Thread Vincent Fox
So does anyone have any insight on BugID 6535160?

We have verified on a similar system, that ZFS shows big latency in filebench 
varmail test.

We formatted the same LUN with UFS and latency went down from 300 ms to 1-2 ms.

http://sunsolve.sun.com/search/document.do?assetkey=1-1-6535160-1

We run Solaris 10u4 on our production systems, don't see any indication of a 
patch for this.

I'll try downloading recent Nevada build and load it on same system and see if 
the problem has indeed vanished post snv_71.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] LUN configuration for disk-based backups

2007-12-14 Thread Al Hopper
On Fri, 14 Dec 2007, Andrew Chace wrote:

[ reformatted  ]
 Hello,

 We have a StorageTek FLX280 (very similar to a 6140) with 16 750 GB 
 SATA drives that we would like to use for disk-based backups. I am 
 trying to make an (educated) guess at what the best configuration 
 for the LUN's on the FLX280 might be.

 I've read, or at least skimmed, most of the ZFS Best Practices 
 Guide over at solarisinternals.com, which has some great 
 information; however, I still do not feel like I have a good 
 understanding of the interaction between ZFS and a disk array.

 More specifically, I am concerned about the number of IOP's that 
 each drive and/or LUN will be able to handle. Seagate lists an 
 average seek time of 9ms, and an average rotational latency of 
 4.16ms for these drives. By my math, each drive should be capable of 
 76 IOP's in a worst case scenario; i.e. completely random I/O.

 These drives support native command queuing, and the controllers on 
 the FLX280 have a battery-backed cache, so I would _assume_ that 
 they are also capable of reordering I/O op's to improve throughput 
 to the disks.

 So, the question is whether or not worst-case IOP's are even 
 relevant. If my assumptions about the controllers on the FLX280 are 
 correct (documentation?), then it seems like we could use RAID-Z to 
 get the throughput we're looking for. If not, we may have to go with 
 RAID-1.

 Anyone have any thoughts on this?

Since ZFS makes it so quick/easy to create storage pools and 
filesystems, the simplist way to determine your optimum config is to 
conduct a set of experiments - using your data and your applications.

Bear in mind that no one ZFS config is ideal for every user data 
application scenario - you may wish to consider 2 or more storage 
pools with different configurations that will be a best fit your 
requirements - given that you may have several different data sets 
with different characteristics.  Then there is ZFS compression - which 
might really help if your data is highly compressible.

Also ensure that you have sufficient network bandwidth into the ZFS 
backup server.

Regards,

Al Hopper  Logical Approach Inc, Plano, TX.  [EMAIL PROTECTED]
Voice: 972.379.2133 Fax: 972.379.2134  Timezone: US CDT
OpenSolaris Governing Board (OGB) Member - Apr 2005 to Mar 2007
http://www.opensolaris.org/os/community/ogb/ogb_2005-2007/
Graduate from sugar-coating school?  Sorry - I never attended! :)
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bugid 6535160

2007-12-14 Thread Neil Perrin
Vincent Fox wrote:
 So does anyone have any insight on BugID 6535160?
 
 We have verified on a similar system, that ZFS shows big latency in filebench 
 varmail test.
 
 We formatted the same LUN with UFS and latency went down from 300 ms to 1-2 
 ms.

This is such a big difference it makes me think something else is going on.
I suspect one of two possible causes:

A) The disk write cache is enabled and volatile. UFS knows nothing of write 
caches
   and requires the write cache to be disabled otherwise corruption can occur.
B) The write cache is non volatile, but ZFS hasn't been configured
   to stop flushing it (set zfs:zfs_nocacheflush = 1).
   Note, ZFS enables the write cache and will flush it as necessary.

 
 http://sunsolve.sun.com/search/document.do?assetkey=1-1-6535160-1
 
 We run Solaris 10u4 on our production systems, don't see any indication
 of a patch for this.
 
 I'll try downloading recent Nevada build and load it on same system and see
 if the problem has indeed vanished post snv_71.

Yes please try this. I think it will make a difference but the delta
will be small.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bugid 6535160

2007-12-14 Thread Neil Perrin
Vincent Fox wrote:
 So does anyone have any insight on BugID 6535160?

 We have verified on a similar system, that ZFS shows big latency in filebench 
 varmail test.

 We formatted the same LUN with UFS and latency went down from 300 ms to 1-2 
 ms.

This is such a big difference it makes me think something else is going on.
I suspect one of two possible causes:

A) The disk write cache is enabled and volatile. UFS knows nothing of write 
caches
  and requires the write cache to be disabled otherwise corruption can occur.
B) The write cache is non volatile, but ZFS hasn't been configured
  to stop flushing it (set zfs:zfs_nocacheflush = 1).
  Note, ZFS enables the write cache and will flush it as necessary.


 http://sunsolve.sun.com/search/document.do?assetkey=1-1-6535160-1

 We run Solaris 10u4 on our production systems, don't see any indication
 of a patch for this.

 I'll try downloading recent Nevada build and load it on same system and see
 if the problem has indeed vanished post snv_71.

Yes please try this. I think it will make a difference but the delta
will be small.

Neil.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-14 Thread can you guess?
 
 ...
 though I'm not familiar with any recent examples in
 normal desktop environments
 
 
 
 One example found during early use of zfs in Solaris
 engineering was
 a system with a flaky power supply.
 
 It seemed to work just fine with ufs but when zfs was
 installed the
 sata drives started to shows many ZFS checksum
 errors.
 
 After replacing the powersupply, the system did not
 detect any more
 errors.
 
 Flaky powersupplies are an important contributor to
 PC unreliability; they
 also tend to fail a lot in various ways.

Thanks - now that you mention it, I think I remember reading about that here 
somewhere.

But did anyone delve into these errors sufficiently to know that they were 
specifically due to controller or disk firmware bugs (since you seem to be 
suggesting by the construction of your response above that they were) rather 
than, say, to RAM errors (if the system in question didn't have ECC RAM, 
anyway) between checksum generation and disk access on either reads or writes 
(the CERN study found a correlation even using ECC RAM between detected RAM 
errors and silent data corruption)?

Not that the generation of such otherwise undetected errors due to a flaky PSU 
isn't interesting in its own right, but this specific sub-thread was about 
whether poor connections were a significant source of such errors (my comment 
about controller and disk firmware bugs having been a suggested potential 
alternative source) - so identifying the underlying mechanisms is of interest 
as well.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Bugid 6535160

2007-12-14 Thread Vincent Fox
 ) The write cache is non volatile, but ZFS hasn't
 been configured
 to stop flushing it (set zfs:zfs_nocacheflush =
  1).

These are a pair of 2540 with dual-controllers, definitely non-volatile cache.

We set the zfs_nocacheflush=1 and that improved things considerably.

ZFS filesystem (2540 arrays):
 fsyncfile3434ops/s   0.0mb/s 17.3ms/op   977us/op-cpu
 fsyncfile2434ops/s   0.0mb/s 17.8ms/op   981us/op-cpu

However still not very good compared to UFS.

We turned off ZIL with zil_disable=1 and WOW!
ZFS ZIL disabled:
 fsyncfile3   1148ops/s   0.0mb/s  0.0ms/op   18us/op-cpu
 fsyncfile2   1148ops/s   0.0mb/s  0.0ms/op   18us/op-cpu

Not a good setting to use in production but useful data.

Anyhow will take some time to get OpenSolaris onto the system, will report back 
then.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-14 Thread Will Murnane
On Dec 14, 2007 4:23 AM, can you guess? [EMAIL PROTECTED] wrote:
 I assume that you're referring to ZFS checksum errors rather than to transfer 
 errors caught by the CRC resulting in retries.

Correct.

 If so, then the next obvious question is, what is causing the ZFS checksum 
 errors?  And (possibly of some help in answering that question) is the disk 
 seeing CRC transfer errors (which show up in its SMART data)?

The memory is ECC in this machine, and Memtest passed it for five
days.  The disk was indeed getting some pretty lousy SMART scores, but
that doesn't explain the controller issue.  This particular controller
is a SIIG-branded silicon image 0680 chipset (which is, apparently, a
piece of junk - if I'd done my homework I would've bought something
else)... but the premise stands.  I bought a piece of consumer-level
hardware off the shelf, it had corruption issues, and ZFS told me
about it when XFS had been silent.

 Once again, a significant question is whether the checksum errors are 
 accompanied by a lot of CRC transfer errors.  If not, that would strongly 
 suggest that they're not coming from bad transfers (and while they could 
 conceivably be the result of commands corrupted on the wire, so much more 
 data is transferred compared to command bandwidth that you'd really expect to 
 see data CRC errors too if commands were getting mangled).  When you wiggle 
 the cables, other things wiggle as well (I assume you've checked that your 
 RAM is solidly seated).

I don't remember offhand if I got CRC errors with the working
controller and drive and bad cabling, sorry.  RAM was solid, as
mentioned earlier.

 The extra strength comes more from its additional coverage (commands as well 
 as data).

Ah, that explains it.

Will
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Is round-robin I/O correct for ZFS?

2007-12-14 Thread Gary Mills
I'm testing an Iscsi multipath configuration on a T2000 with two disk
devices provided by a Netapp filer.  Both the T2000 and the Netapp
have two ethernet interfaces for Iscsi, going to separate switches on
separate private networks.  The scsi_vhci devices look like this in
`format':

   1. c4t60A98000433469764E4A413571444B63d0 NETAPP-LUN-0.2-50.00GB
  /scsi_vhci/[EMAIL PROTECTED]
   2. c4t60A98000433469764E4A41357149432Fd0 NETAPP-LUN-0.2-50.00GB
  /scsi_vhci/[EMAIL PROTECTED]

These are concatenated in the ZFS pool.  There are two network paths
to each of the two devices, managed by the scsi_vhci driver.  The pool
looks like this:

  # zpool status
pool: space
   state: ONLINE
   scrub: none requested
  config:
  
  NAME STATE READ WRITE CKSUM
  spaceONLINE   0 0 0
c4t60A98000433469764E4A413571444B63d0  ONLINE   0 0 0
c4t60A98000433469764E4A41357149432Fd0  ONLINE   0 0 0
  
  errors: No known data errors

The /kernel/drv/scsi_vhci.conf file, unchanged from the defaut, specifies:

load-balance=round-robin;

Indeed, when I generate I/O on a ZFS filesystem, I see TCP traffic with
`snoop' on both of the Iscsi ethernet interfaces.  It certainly appears
to be doing round-robin.  The I/O are going to the same disk devices,
of course, but by two different paths.  Is this a correct configuration
for ZFS?  I assume it's safe, but I thought I should check.

-- 
-Gary Mills--Unix Support--U of M Academic Computing and Networking-
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Update: zpool kernel panics.

2007-12-14 Thread Edward Irvine
Hi Folks,

Begin forwarded message:

 From: Edward Irvine [EMAIL PROTECTED]
 Date: 12 December 2007 8:44:57 AM
 To: [EMAIL PROTECTED]
 Subject: Fwd: [zfs-discuss] zpool kernel panics.

 FYI ...

 Begin forwarded message:

 From: James C. McPherson [EMAIL PROTECTED]
 Date: 12 December 2007 8:06:51 AM
 To: Edward Irvine [EMAIL PROTECTED]
 Cc: ZFS Discussions zfs-discuss@opensolaris.org
 Subject: Re: [zfs-discuss] zpool kernel panics.
 Reply-To: [EMAIL PROTECTED]


 Hi Eddie,

 Edward Irvine wrote:
 Each time the system crashes, it crashes with the same error  
 message.  This suggests to me that it is zpool corruption rather  
 than faulty  RAM, which is to blame.
 So - is this particular zpool a lost cause?  :\

 It's looking that way to me, but I'm definitely no expert.

 A number of folks have pointed out that this bug may have been  
 fixed  in a very recent version (nv-77?) of opensolaris.  As a  
 last ditch  approach, I'm thinking that I could put the current  
 system disks  (sol10u4) aside, do a quick install the latest  
 opensolaris, import  the zpool, and do a zpool scrub, export the  
 zpool, shutdown, swap in  the sol10u4 disks, reboot, import.
 Sigh. Does this approach sound plausible?

 It's definitely worth a shot, as long as you don't have
 to zpool upgrade in order to do it.

OK - this appeared to work:  imported the zpool into opensolaris 77,  
did a zpool scrub - and no kernel panics. Cool!

But - after reimporting the zpool back into Solaris10u4 (where it  
belongs) a zpool scrub still causes a kernel panic - although it  
seemed to take a bit longer to panic. Same error message as before -

panic[cpu1]/thread=2a1015c7cc0:
Dec 15 12:49:35 server unix: [ID 361072 kern.notice] zfs: freeing  
free segment (offset=423713792 size=1024)

Note that opensolaris 77 and Solaris10u4 are on the same physical  
hardware - I'm just booting off different system disks.

 I pulled your crash dump inside Sun, thankyou, but I haven't
 had a chance to analyze it so I've passed the details on to
 more knowledgeable ZFS ppl.


 James C. McPherson
 --
 Senior Kernel Software Engineer, Solaris
 Sun Microsystems
 http://blogs.sun.com/jmcphttp://www.jmcp.homeunix.com/blog




Sigh. This must definitely be a bug.

Eddie

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Nice chassis for ZFS server

2007-12-14 Thread can you guess?
the next obvious question is, what is
 causing the ZFS checksum errors?  And (possibly of
 some help in answering that question) is the disk
 seeing CRC transfer errors (which show up in its
 SMART data)?
 
 The memory is ECC in this machine, and Memtest passed
 it for five
 days.  The disk was indeed getting some pretty lousy
 SMART scores,

Seagate ATA disks (if that's what you were using) are notorious for this in a 
couple of specific metrics:  they ship from the factory that way.  This does 
not appear to be indicative of any actual problem but rather of error 
tablulation which they perform differently than other vendors do (e.g., I could 
imagine that they did something unusual in their burn-in exercising that 
generated nominal errors, but that's not even speculation, just a random guess).

 but
 that doesn't explain the controller issue.  This
 particular controller
 is a SIIG-branded silicon image 0680 chipset (which
 is, apparently, a
 piece of junk - if I'd done my homework I would've
 bought something
 else)... but the premise stands.  I bought a piece of
 consumer-level
 hardware off the shelf, it had corruption issues, and
 ZFS told me
 about it when XFS had been silent.

Then we've been talking at cross-purposes.  Your original response was to my 
request for evidence that *platter errors that escape detection by the disk's 
ECC mechanisms* occurred sufficiently frequently to be a cause for concern - 
and that's why I asked specifically what was causing the errors you saw (to see 
whether they were in fact the kind for which I had requested evidence).

Not that detecting silent errors due to buggy firmware is useless:  it clearly 
saved you from continuing corruption in this case.  My impression is that in 
conventional consumer installations (typical consumers never crack open their 
case at all, let alone to add a RAID card) controller and disk firmware is 
sufficiently stable (especially for the limited set of functions demanded of 
it) that ZFS's added integrity checks may not count for a great deal (save 
perhaps peace of mind, but typical consumers aren't sufficiently aware of 
potential dangers to suffer from deficits in that area) - but your experience 
indicates that when you stray from that mold ZFS's added protection may 
sometimes be as significant as it was for Robert's mid-range array firmware 
bugs.

And since there indeed was a RAID card involved in the original hypothetical 
situation under discussion, the fact that I was specifically referring to 
undetectable *disk* errors was only implied by my subsequent discussion of disk 
error rates, rather than explicit.

The bottom line appears to be that introducing non-standard components into the 
path between RAM and disk has, at least for some specific subset of those 
components, the potential to introduce silent errors of the form that ZFS can 
catch - quite possibly in considerably greater numbers that the kinds of 
undetected disk errors that I was talking about ever would (that RAID card you 
were using has a relatively popular low-end chipset, and Robert's mid-range 
arrays were hardly fly-by-night).  So while I'm still not convinced that ZFS 
offers significant features in the reliability area compared with other 
open-source *software* solutions, the evidence that it may do so in more 
sophisticated (but not quite high-end) hardware environments is becoming more 
persuasive.

- bill
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is round-robin I/O correct for ZFS?

2007-12-14 Thread Jonathan Loran

This is the same configuration we use on 4 separate servers (T2000, two 
X4100, and a V215).  We do use a different iSCSI solution, but we have 
the same multi path config setup with scsi_vhci.  Dual GigE switches on 
separate NICs both server and iSCSI node side.  We suffered from the 
e1000g interface flapping bug, on two of these systems, and one time a 
SAN interface went down to stay (until reboot).  The vhci multi path 
performed flawlessly.  I scrubbed the pools (one of them is 10TB) and no 
errors were found, even though we had heavy IO at the time of the NIC 
failure.  I think this configuration is a good one.

Jon

Gary Mills wrote:
 I'm testing an Iscsi multipath configuration on a T2000 with two disk
 devices provided by a Netapp filer.  Both the T2000 and the Netapp
 have two ethernet interfaces for Iscsi, going to separate switches on
 separate private networks.  The scsi_vhci devices look like this in
 `format':

1. c4t60A98000433469764E4A413571444B63d0 NETAPP-LUN-0.2-50.00GB
   /scsi_vhci/[EMAIL PROTECTED]
2. c4t60A98000433469764E4A41357149432Fd0 NETAPP-LUN-0.2-50.00GB
   /scsi_vhci/[EMAIL PROTECTED]

 These are concatenated in the ZFS pool.  There are two network paths
 to each of the two devices, managed by the scsi_vhci driver.  The pool
 looks like this:

   # zpool status
 pool: space
state: ONLINE
scrub: none requested
   config:
   
   NAME STATE READ WRITE CKSUM
   spaceONLINE   0 0 0
 c4t60A98000433469764E4A413571444B63d0  ONLINE   0 0 0
 c4t60A98000433469764E4A41357149432Fd0  ONLINE   0 0 0
   
   errors: No known data errors

 The /kernel/drv/scsi_vhci.conf file, unchanged from the defaut, specifies:

 load-balance=round-robin;

 Indeed, when I generate I/O on a ZFS filesystem, I see TCP traffic with
 `snoop' on both of the Iscsi ethernet interfaces.  It certainly appears
 to be doing round-robin.  The I/O are going to the same disk devices,
 of course, but by two different paths.  Is this a correct configuration
 for ZFS?  I assume it's safe, but I thought I should check.

   

-- 


- _/ _/  /   - Jonathan Loran -   -
-/  /   /IT Manager   -
-  _  /   _  / / Space Sciences Laboratory, UC Berkeley
-/  / /  (510) 643-5146 [EMAIL PROTECTED]
- __/__/__/   AST:7731^29u18e3
 


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] JBOD performance

2007-12-14 Thread Peter Schuller
 Use a faster processor or change to a mirrored configuration.
 raidz2 can become processor bound in the Reed-Soloman calculations
 for the 2nd parity set.  You should be able to see this in mpstat, and to
 a coarser grain in vmstat.

Hmm. Is the OP's hardware *that* slow? (I don't know enough about the Sun 
hardware models)

I have a 5-disk raidz2 (cheap SATA) here on my workstation, which is an X2 
3800+ (i.e., one of the earlier AMD dual-core offerings). Here's me dd:ing to 
a file on FreeBSD on ZFS running on that hardware:

promraid 741G   387G  0380  0  47.2M
promraid 741G   387G  0336  0  41.8M
promraid 741G   387G  0424510  51.0M
promraid 741G   387G  0441  0  54.5M
promraid 741G   387G  0514  0  19.2M
promraid 741G   387G 34192  4.12M  24.1M
promraid 741G   387G  0341  0  42.7M
promraid 741G   387G  0361  0  45.2M
promraid 741G   387G  0350  0  43.9M
promraid 741G   387G  0370  0  46.3M
promraid 741G   387G  1423   134K  51.7M
promraid 742G   386G 22329  2.39M  10.3M
promraid 742G   386G 28214  3.49M  26.8M
promraid 742G   386G  0347  0  43.5M
promraid 742G   386G  0349  0  43.7M
promraid 742G   386G  0354  0  44.3M
promraid 742G   386G  0365  0  45.7M
promraid 742G   386G  2460  7.49K  55.5M

At this point the bottleneck looks architectural rather than CPU. None of the 
cores are saturated, and the CPU usage of the ZFS kernel threads is pretty 
low.

I say architectural because writes to the underlying devices are not 
sustained; it drops to almost zero for certain periods (this is more visible 
in iostat -x than it is in the zpool statistics). What I think is happening 
is that ZFS is too late to evict data in the cache, thus blocking the writing 
process. Once a transaction group with a bunch of data gets committed the 
application unblocks, but presumably ZFS waits for a little while before 
resuming writes.

Note that this is also being run on plain hardware; it's not even PCI Express. 
During throughput peaks, but not constantly, the bottleneck is probably the 
PCI bus.

-- 
/ Peter Schuller

PGP userID: 0xE9758B7D or 'Peter Schuller [EMAIL PROTECTED]'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org



signature.asc
Description: This is a digitally signed message part.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss