Re[6]: [zfs-discuss] Re: Re: How much do we really want zpool remove?
Hello Erik, Wednesday, February 28, 2007, 12:55:24 AM, you wrote: ET Honestly, no, I don't consider UFS a modern file system. :-) ET It's just not in the same class as JFS for AIX, xfs for IRIX, or even ET VxFS. The point is that fsck was due to an array corrupting data. IMHO it would hit JFS, XFS or VxFS as bad as UFS if not worse. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Today PANIC :(
Feb 28 05:47:31 server141 genunix: [ID 403854 kern.notice] assertion failed: ss == NULL, file: ../../common/fs/zfs/space_map.c, line: 81 Feb 28 05:47:31 server141 unix: [ID 10 kern.notice] Feb 28 05:47:31 server141 genunix: [ID 802836 kern.notice] fe8000d559f0 fb9acff3 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55a70 zfs:space_map_add+c2 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55aa0 zfs:space_map_free+22 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55ae0 zfs:space_map_vacate+38 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55b40 zfs:zfsctl_ops_root+2fdbc7e7 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55b70 zfs:vdev_sync_done+2b () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55bd0 zfs:spa_sync+215 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55c60 zfs:txg_sync_thread+115 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55c70 unix:thread_start+8 () Feb 28 05:47:31 server141 unix: [ID 10 kern.notice] Feb 28 05:47:31 server141 genunix: [ID 672855 kern.notice] syncing file systems... Feb 28 05:47:32 server141 genunix: [ID 733762 kern.notice] 1 Feb 28 05:47:33 server141 genunix: [ID 904073 kern.notice] done What happened this time? Any suggest? thanks, gino This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Why number of NFS threads jumps to the max value?
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6467988 NFSD threads are created on a demand spike (all of them waiting on I/O) but thentend to stick around servicing moderate loads. -r Leon Koll wrote: Hello, gurus I need your help. During the benchmark test of NFS-shared ZFS file systems at some moment the number of NFS threads jumps to the maximal value, 1027 (NFSD_SERVERS was set to 1024). The latency also grows and the number of IOPS is going down. I've collected the output of echo ::pgrep nfsd | ::walk thread | ::findstack -v | mdb -k that can be seen here: http://tinyurl.com/yrvn4z Could you please look at it and tell me what's wrong with my NFS server. Appreciate, -- Leon This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] How crazy ZFS import command................
Hi All, Today I created a zpool (name as testpool) on c0t0d0. #zpool create -m /masthan testpool c0t0d0 Then I written some data on the pool #cp /usha/* /masthan/ Then I destroyed the zpool #zpool destroy testpool After that I created UFS File System on the same device i.e. on c0t0d0 . #newfs -f 2048 /dev/rdsk/c0t0d0s2 and then I mounted it and i written some data on it ... after that It is unmounted. But still I am able to see ZFS file system on the c0t0d0 the command #zpool import -Df testpool Is successfully importing the testpool and it is successfully show all the files what I written earlier Whats wrong with ZFS import command ? On a ZFS disk after creating a new file system also it is recovering old ZFS file system on it . Why ZFS designed like that ? How it is recovering the old ZFS Fs ? -Masthan - Food fight? Enjoy some healthy debate in the Yahoo! Answers Food Drink QA.___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Efficiency when reading the same file blocks
Jeff Davis writes: On February 26, 2007 9:05:21 AM -0800 Jeff Davis But you have to be aware that logically sequential reads do not necessarily translate into physically sequential reads with zfs. zfs I understand that the COW design can fragment files. I'm still trying to understand how that would affect a database. It seems like that may be bad for performance on single disks due to the seeking, but I would expect that to be less significant when you have many spindles. I've read the following blogs regarding the topic, but didn't find a lot of details: http://blogs.sun.com/bonwick/entry/zfs_block_allocation http://blogs.sun.com/realneel/entry/zfs_and_databases Here is my take on this: DB updates (writes) are mostly governed by the synchronous write code path which for ZFS means the ZIL performance. It's already quite good in that it aggregatesmultiple updates into few I/Os. Some further improvements are in the works. COW, in general, simplify greatly write code path. DB reads in a transaction workloads are mostly random. If the DB is not cacheable the performance will be that of a head seek no matter what FS is used (since we can't guess in advance where to seek, COW nature does not help nor hinders performance). DB reads in a decision workloads can benefit from good prefetching (since here we actually know where the next seeks will be). -r This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Cluster File System Use Cases
I'm an Oracle DBA and we are doing ASM on SUN with RAC. I am happy with ASM's performance but am interested in Clustering. I mentioned to Bob Netherton that if Sun could make it a clustering file system, that helps them enable the grid further. Oracle wrote and gave OCFS2 to the Linux Kernel. Since Solaris is GPL too and CDDL (Correct me if Im wrong) then couldn't they take OCFS2 and port it into Solaris? Any chance at adding Clustering to ZFS? Just to see it and play with it would be fun. ZFS is open source so if someone cares to write their own clustering file system, they can : ) This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Cluster File System Use Cases
Also Oracle forums and SUN forums have the SAME exact look and feel... hmmm. Even the options are exactly the same... weird. Both are from a company called Jive Software that does enterprise forums. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Filesystem usage, quotas, and free space
zfs list -o name,type,used,available,referenced,quota,compressratio NAME TYPE USED AVAIL REFER QUOTA RATIO pool/notes filesystem 151G 149G 53.2G 300G 1.25x pool/[EMAIL PROTECTED]snapshot 48.1G - 57.7G - 1.27x pool/[EMAIL PROTECTED]snapshot 1.55G - 55.4G - 1.26x pool/[EMAIL PROTECTED]snapshot 1.50G - 55.4G - 1.26x pool/[EMAIL PROTECTED]snapshot 7.75G - 55.6G - 1.26x Can someone explain to me how to add up the space used by the snapshots and see why there is 151G actually used? There is only 53G of data in the main filesystem, so I know the snapshots are using the space, but the numbers don't add up right... This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Cluster File System Use Cases
On Wed, Feb 28, 2007 at 07:23:44AM -0800, Thomas Roach wrote: I'm an Oracle DBA and we are doing ASM on SUN with RAC. I am happy with ASM's performance but am interested in Clustering. I mentioned to Bob Netherton that if Sun could make it a clustering file system, that helps them enable the grid further. Oracle wrote and gave OCFS2 to the Linux Kernel. Since Solaris is GPL too and CDDL (Correct me if Im wrong) then couldn't they take OCFS2 and port it into Solaris? Any chance at adding Clustering to ZFS? ASM was Storage-Tek's rebranding of SAM-QFS. SAM-QFS is already a shared (clustering) filesystem. You need to upgrade :) Look for Shared QFS. And yes, we're actively pushing the SAM-QFS code through the open-source process. Here's the first blog entry: http://blogs.sun.com/samqfs/entry/welcome_to_sam_qfs_weblog Dean ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Today PANIC :(
Gino, We have ween this before but only very rarely and never got a good crash dump. Coincidently, we saw it only yesterday on a server here, and are currently investigating it. Did you also get a dump we can access? That would If not can you tell us what zfs version you were running. At the moment I'm not sure how even you can recover from it. Sorry about this problem. FYI this is bug: http://bugs.opensolaris.org/view_bug.do?bug_id=6458218 Neil. Gino Ruopolo wrote On 02/28/07 02:17,: Feb 28 05:47:31 server141 genunix: [ID 403854 kern.notice] assertion failed: ss == NULL, file: ../../common/fs/zfs/space_map.c, line: 81 Feb 28 05:47:31 server141 unix: [ID 10 kern.notice] Feb 28 05:47:31 server141 genunix: [ID 802836 kern.notice] fe8000d559f0 fb9acff3 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55a70 zfs:space_map_add+c2 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55aa0 zfs:space_map_free+22 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55ae0 zfs:space_map_vacate+38 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55b40 zfs:zfsctl_ops_root+2fdbc7e7 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55b70 zfs:vdev_sync_done+2b () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55bd0 zfs:spa_sync+215 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55c60 zfs:txg_sync_thread+115 () Feb 28 05:47:31 server141 genunix: [ID 655072 kern.notice] fe8000d55c70 unix:thread_start+8 () Feb 28 05:47:31 server141 unix: [ID 10 kern.notice] Feb 28 05:47:31 server141 genunix: [ID 672855 kern.notice] syncing file systems... Feb 28 05:47:32 server141 genunix: [ID 733762 kern.notice] 1 Feb 28 05:47:33 server141 genunix: [ID 904073 kern.notice] done What happened this time? Any suggest? thanks, gino This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Filesystem usage, quotas, and free space
zfs list -o name,type,used,available,referenced,quota,compressratio NAME TYPE USED AVAIL REFER QUOTA RATIO pool/notes filesystem 151G 149G 53.2G 300G 1.25x pool/[EMAIL PROTECTED]snapshot 48.1G - 57.7G - 1.27x pool/[EMAIL PROTECTED]snapshot 1.55G - 55.4G - 1.26x pool/[EMAIL PROTECTED]snapshot 1.50G - 55.4G - 1.26x pool/[EMAIL PROTECTED]snapshot 7.75G - 55.6G - 1.26x Can someone explain to me how to add up the space used by the snapshots and see why there is 151G actually used? There is only 53G of data in the main filesystem, so I know the snapshots are using the space, but the numbers don't add up right... They're not supposed to add up. If data is shared between two or more snapshots, it will not appear in the USED portion of any snapshot, but of course it is taking space in the pool. So only data unique to the snapshot is shown in USED. I believe there's currently no way to view the distribution of non-unique data in the snapshots. -- Darren Dunham [EMAIL PROTECTED] Senior Technical Consultant TAOShttp://www.taos.com/ Got some Dr Pepper? San Francisco, CA bay area This line left intentionally blank to confuse you. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Efficiency when reading the same file blocks
Frank Hofmann writes: On Tue, 27 Feb 2007, Jeff Davis wrote: Given your question are you about to come back with a case where you are not seeing this? As a follow-up, I tested this on UFS and ZFS. UFS does very poorly: the I/O rate drops off quickly when you add processes while reading the same blocks from the same file at the same time. I don't know why this is, and it would be helpful if someone explained it to me. UFS readahead isn't MT-aware - it starts trashing when multiple threads perform reads of the same blocks. UFS readahead only works if it's a single thread per file, as the readahead state, i_nextr, is per-inode (and not a per-thread) state. Multiple concurrent readers trash this for each other, as there's only one-per-file. To qualify 'trashing', this means UFS looses tracks of the access, considers workload as random and so does not do any readahead. ZFS did a lot better. There did not appear to be any drop-off after the first process. There was a drop in I/O rate as I kept adding processes, but in that case the CPU was at 100%. I haven't had a chance to test this on a bigger box, but I suspect ZFS is able to keep the sequential read going at full speed (at least if the blocks happen to be written sequentially). ZFS caches multiple readahead states - see the leading comment in usr/src/uts/common/fs/zfs/vdev_cache.c in your favourite workspace. The vdev_cache is where you have the low level device level prefetch (I/O for 8K, read 64K of whatever happens to be under the disk head). dmu_zfetch.c is where the logical prefetching occurs. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] How crazy ZFS import command................
This would occur if /dev/rdsk/c0t0d0s2 is not the same as c0t0d0 Double check your partition table. -- richard dudekula mastan wrote: Hi All, Today I created a zpool (name as testpool) on c0t0d0. #zpool create -m /masthan testpool c0t0d0 Then I written some data on the pool #cp /usha/* /masthan/ Then I destroyed the zpool #zpool destroy testpool After that I created UFS File System on the same device i.e. on c0t0d0 . #newfs -f 2048 /dev/rdsk/c0t0d0s2 and then I mounted it and i written some data on it ... after that It is unmounted. But still I am able to see ZFS file system on the c0t0d0 the command #zpool import -Df testpool Is successfully importing the testpool and it is successfully show all the files what I written earlier Whats wrong with ZFS import command ? On a ZFS disk after creating a new file system also it is recovering old ZFS file system on it . Why ZFS designed like that ? How it is recovering the old ZFS Fs ? -Masthan Food fight? http://answers.yahoo.com/dir/index;_ylc=X3oDMTFvbGNhMGE3BF9TAzM5NjU0NTEwOARfcwMzOTY1NDUxMDMEc2VjA21haWxfdGFnbGluZQRzbGsDbWFpbF90YWcx?link=asksid=396545367 Enjoy some healthy debate in the Yahoo! Answers Food Drink QA. http://answers.yahoo.com/dir/index;_ylc=X3oDMTFvbGNhMGE3BF9TAzM5NjU0NTEwOARfcwMzOTY1NDUxMDMEc2VjA21haWxfdGFnbGluZQRzbGsDbWFpbF90YWcx?link=asksid=396545367 ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] FAULTED ZFS volume even though it is mirrored
Hi there, We have been using ZFS for our backup storage since August last year. Overall it's been very good, handling transient data issues and even drop outs of connectivity to the iscsi arrays we are using for storage. However, I logged in this morning to discover that the ZFS volume could not be read. In addition, it appears to have marked all drives, mirrors the volume itself as 'corrupted'. From what I can tell I'm completely unable to perform a zpool import and the only solution according to SUN is to 'destroy and recreate the volume from a backup'. That's not overly helpful considering this IS our backup source AND the fact we thought we had redundancy covered through the implementation of multiple mirrors. The arrays themselves are configured in 5 x 880GB RAID5 LUNs per array. These 5 are then mirrored to their matching 5 on the other array. To be honest, I'm pretty disappointed this occurred. I was of the assumption that by setting up a set of mirrored volumes I would avoid this exact problem. Now, for whatever reason (perhaps triggered by an iscsi connection timeout) ZFS has decided all devices in the entire array are corrupt. I guess that's why assumptions are the mother of all ..s. :) While this data isn't mission critical it IS of significant use to us for historical analysis purposes so my assistance would be greatly appreciated. I've included a dump of the zpool import response below. Thanks for your help! Stuart [EMAIL PROTECTED] ~]$ zpool import pool: ax150 id: 6217526921542582188 state: FAULTED status: One or more devices contains corrupted data. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-5E config: ax150 FAULTED corrupted data mirror FAULTED corrupted data c5t6006016031E0180032F8E9868E30DB11d0 FAULTED corrupted data c5t6006016071851800B86C8EE05831DB11d0 FAULTED corrupted data mirror FAULTED corrupted data c5t6006016031E01800AC7E34918E30DB11d0 FAULTED corrupted data c5t600601607185180010A65FE75831DB11d0 FAULTED corrupted data mirror FAULTED corrupted data c5t6006016031E0180026057F9B8E30DB11d0 FAULTED corrupted data c5t6006016071851800CA1D94EF5831DB11d0 FAULTED corrupted data mirror FAULTED corrupted data c5t6006016031E018005A9B74A48E30DB11d0 FAULTED corrupted data c5t600601607185180064063BF85831DB11d0 FAULTED corrupted data mirror FAULTED corrupted data c5t6006016031E018003810E7AC8E30DB11d0 FAULTED corrupted data c5t60060160718518009A7926FF5831DB11d0 FAULTED corrupted data [EMAIL PROTECTED] ~]$ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: .zfs snapshot directory in all directories
On 2/27/07, Eric Haycraft [EMAIL PROTECTED] wrote: I am no scripting pro, but I would imagine it would be fairly simple to create a script and batch it to make symlinks in all subdirectories. I've done something similar using NFS aggregation products. The real problem is when you export, especially via CIFS (SMB) from a given directory. Let's take a given example of a division based file tree. A given area of the company, say marketing, has multiple sub folders: /pool/marketing, /pool/marketing/docs, /pool/marketing/projects, /pool/marketing/users Well, Marketing wants Windows access, so you allow shares at any point, including at /pool/marketing/users. Well, symlinks don't help, and a snapshot mechanism needs to be there at the users subdirectory level. Some would argue to promote /pool/marketing/users into a ZFS filesystem. Well, the other problem arises, in that at least with NFS, you need to share per filesystem and clients must multiple mount the filesystem (/pool/marketing, /pool/marketing/users, /pool/marketing/docs, etc). Mounting /pool/marketing alone will show you empty directories for users, projects, etc if further mounting doesn't exist. Yeah.. automounts, nfsv4, blah blah :) A lot of setup when all you need is pervasive .snapshot trees similar to NetApp. I just hope that don't have a bloody patent on something as simple as that to solve this. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Efficiency when reading the same file blocks
ZFS Group, My two cents.. Currently, in my experience, it is a waste of time to try to guarantee exact location of disk blocks with any FS. A simple reason exception is bad blocks, a neighboring block will suffice. Second, current disk controllers have logic that translates and you can't be sure outside of the firmware where the disk block actually is. Yes, I wrote code in this area before. Third, some FSs, do a Read-Modify-Write, where the write is NOT, NOT, NOT overwriting the original location of the read. Why? for a couple of reasons.. One is that the original read may have existed in a fragment. Some do it for FS consistency to allow the write to become a partial write in some circumstances (Ex:crash), and the second file block location then allows for FS consistency and the ability to recover the original contents. No overwite. Another reason is that sometimes we are filling a hole within a FS object window from a base addr to new offset. The ability to concatenate allows us to reduce the number of future seeks and small reads / writes versus having a slightly longer transfer time for the larger theorectical disk block. Thus, the tradeoff is that we accept that we waste some FS space, we may not fully optimize the location of the disk block, we have larger read and write single large block latencies, but... we seek less, the per byte overhead is less, we can order our writes so that we again seek less, our writes can be delayed (assuming that we might write multiple times and then commmit on close) to minimize the amount of actual write operations, we can prioritize our reads over our writes to decrease read latency, etc.. Bottom line is that performance may suffer if we do alot of random small read-modify-writes within FS objects that use a very large disk block. Since the actual CHANGE is small to the file, each small write outside of a delayed write window, will consume at least 1 disk block. However, some writes are to FS objects that are writethru and thus each small write will consume a new disk block. Mitchell Erblich - Roch - PAE wrote: Jeff Davis writes: On February 26, 2007 9:05:21 AM -0800 Jeff Davis But you have to be aware that logically sequential reads do not necessarily translate into physically sequential reads with zfs. zfs I understand that the COW design can fragment files. I'm still trying to understand how that would affect a database. It seems like that may be bad for performance on single disks due to the seeking, but I would expect that to be less significant when you have many spindles. I've read the following blogs regarding the topic, but didn't find a lot of details: http://blogs.sun.com/bonwick/entry/zfs_block_allocation http://blogs.sun.com/realneel/entry/zfs_and_databases Here is my take on this: DB updates (writes) are mostly governed by the synchronous write code path which for ZFS means the ZIL performance. It's already quite good in that it aggregatesmultiple updates into few I/Os. Some further improvements are in the works. COW, in general, simplify greatly write code path. DB reads in a transaction workloads are mostly random. If the DB is not cacheable the performance will be that of a head seek no matter what FS is used (since we can't guess in advance where to seek, COW nature does not help nor hinders performance). DB reads in a decision workloads can benefit from good prefetching (since here we actually know where the next seeks will be). -r This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FAULTED ZFS volume even though it is mirrored
That's quite strange. What version of ZFS are you running? What does 'zdb -l /dev/dsk/c5t6006016031E0180032F8E9868E30DB11d0s0' show? - eric On Thu, Mar 01, 2007 at 09:31:05AM +1000, Stuart Low wrote: Hi there, We have been using ZFS for our backup storage since August last year. Overall it's been very good, handling transient data issues and even drop outs of connectivity to the iscsi arrays we are using for storage. However, I logged in this morning to discover that the ZFS volume could not be read. In addition, it appears to have marked all drives, mirrors the volume itself as 'corrupted'. From what I can tell I'm completely unable to perform a zpool import and the only solution according to SUN is to 'destroy and recreate the volume from a backup'. That's not overly helpful considering this IS our backup source AND the fact we thought we had redundancy covered through the implementation of multiple mirrors. The arrays themselves are configured in 5 x 880GB RAID5 LUNs per array. These 5 are then mirrored to their matching 5 on the other array. To be honest, I'm pretty disappointed this occurred. I was of the assumption that by setting up a set of mirrored volumes I would avoid this exact problem. Now, for whatever reason (perhaps triggered by an iscsi connection timeout) ZFS has decided all devices in the entire array are corrupt. I guess that's why assumptions are the mother of all ..s. :) While this data isn't mission critical it IS of significant use to us for historical analysis purposes so my assistance would be greatly appreciated. I've included a dump of the zpool import response below. Thanks for your help! Stuart [EMAIL PROTECTED] ~]$ zpool import pool: ax150 id: 6217526921542582188 state: FAULTED status: One or more devices contains corrupted data. action: The pool cannot be imported due to damaged devices or data. see: http://www.sun.com/msg/ZFS-8000-5E config: ax150 FAULTED corrupted data mirror FAULTED corrupted data c5t6006016031E0180032F8E9868E30DB11d0 FAULTED corrupted data c5t6006016071851800B86C8EE05831DB11d0 FAULTED corrupted data mirror FAULTED corrupted data c5t6006016031E01800AC7E34918E30DB11d0 FAULTED corrupted data c5t600601607185180010A65FE75831DB11d0 FAULTED corrupted data mirror FAULTED corrupted data c5t6006016031E0180026057F9B8E30DB11d0 FAULTED corrupted data c5t6006016071851800CA1D94EF5831DB11d0 FAULTED corrupted data mirror FAULTED corrupted data c5t6006016031E018005A9B74A48E30DB11d0 FAULTED corrupted data c5t600601607185180064063BF85831DB11d0 FAULTED corrupted data mirror FAULTED corrupted data c5t6006016031E018003810E7AC8E30DB11d0 FAULTED corrupted data c5t60060160718518009A7926FF5831DB11d0 FAULTED corrupted data [EMAIL PROTECTED] ~]$ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Efficiency when reading the same file blocks
On 28-Feb-07, at 6:43 PM, Erblichs wrote: ZFS Group, My two cents.. Currently, in my experience, it is a waste of time to try to guarantee exact location of disk blocks with any FS. ? Sounds like you're confusing logical location with physical location, throughout this post. I'm sure Roch meant logical location. --T A simple reason exception is bad blocks, a neighboring block will suffice. Second, current disk controllers have logic that translates and you can't be sure outside of the firmware where the disk block actually is. Yes, I wrote code in this area before. Third, some FSs, do a Read-Modify-Write, where the write is NOT, NOT, NOT overwriting the original location of the read. ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FAULTED ZFS volume even though it is mirrored
Heya, Firstly thanks for your help. That's quite strange. Your telling me! :) I like ZFS I really do but this has dented my love of it. :-/ What version of ZFS are you running? [EMAIL PROTECTED] ~]$ pkginfo -l SUNWzfsu PKGINST: SUNWzfsu NAME: ZFS (Usr) CATEGORY: system ARCH: i386 VERSION: 11.10.0,REV=2006.05.18.01.46 BASEDIR: / VENDOR: Sun Microsystems, Inc. DESC: ZFS libraries and commands PSTAMP: on10-patch-x20060302165447 INSTDATE: Nov 28 2006 16:53 HOTLINE: Please contact your local service provider STATUS: completely installed FILES: 45 installed pathnames 14 shared pathnames 1 linked files 16 directories 13 executables 3612 blocks used (approx) [EMAIL PROTECTED] ~]$ What does 'zdb -l /dev/dsk/c5t6006016031E0180032F8E9868E30DB11d0s0' show? Included below. I don't suppose there a full manual of zdb? Stuart [EMAIL PROTECTED] ~]$ zdb -l /dev/dsk/c5t6006016031E0180032F8E9868E30DB11d0s0 LABEL 0 version=2 name='ax150' state=0 txg=2063663 pool_guid=6217526921542582188 top_guid=9705179868573891 guid=10975169994332783304 vdev_tree type='mirror' id=0 guid=9705179868573891 metaslab_array=13 metaslab_shift=33 ashift=9 asize=944879435776 children[0] type='disk' id=0 guid=10975169994332783304 path='/dev/dsk/c4t6006016031E0180032F8E9868E30DB11d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=314 children[1] type='disk' id=1 guid=511925173000616 path='/dev/dsk/c4t6006016071851800B86C8EE05831DB11d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=313 LABEL 1 version=2 name='ax150' state=0 txg=2063663 pool_guid=6217526921542582188 top_guid=9705179868573891 guid=10975169994332783304 vdev_tree type='mirror' id=0 guid=9705179868573891 metaslab_array=13 metaslab_shift=33 ashift=9 asize=944879435776 children[0] type='disk' id=0 guid=10975169994332783304 path='/dev/dsk/c4t6006016031E0180032F8E9868E30DB11d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=314 children[1] type='disk' id=1 guid=511925173000616 path='/dev/dsk/c4t6006016071851800B86C8EE05831DB11d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=313 LABEL 2 version=2 name='ax150' state=0 txg=2063663 pool_guid=6217526921542582188 top_guid=9705179868573891 guid=10975169994332783304 vdev_tree type='mirror' id=0 guid=9705179868573891 metaslab_array=13 metaslab_shift=33 ashift=9 asize=944879435776 children[0] type='disk' id=0 guid=10975169994332783304 path='/dev/dsk/c4t6006016031E0180032F8E9868E30DB11d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=314 children[1] type='disk' id=1 guid=511925173000616 path='/dev/dsk/c4t6006016071851800B86C8EE05831DB11d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=313 LABEL 3 version=2 name='ax150' state=0 txg=2063663 pool_guid=6217526921542582188 top_guid=9705179868573891 guid=10975169994332783304 vdev_tree type='mirror' id=0 guid=9705179868573891 metaslab_array=13 metaslab_shift=33 ashift=9 asize=944879435776 children[0] type='disk' id=0 guid=10975169994332783304 path='/dev/dsk/c4t6006016031E0180032F8E9868E30DB11d0s0' devid='id1,[EMAIL PROTECTED]/a' whole_disk=1 DTL=314 children[1] type='disk' id=1 guid=511925173000616 path='/dev/dsk/c4t6006016071851800B86C8EE05831DB11d0s0'
Re: [zfs-discuss] FAULTED ZFS volume even though it is mirrored
The label looks sane. Can you try running: # dtrace -n vdev_set_state:entry'[EMAIL PROTECTED], args[3], stack()] = count()}' While executing 'zpool import' and send the output? Can you also send '::dis' output (from 'mdb -k') of the function immediately above vdev_set_state() in the above stacks? I think the function should be vdev_validate(), but I don't remember if it's the same in the ZFS version you have. - Eric -- Eric Schrock, Solaris Kernel Development http://blogs.sun.com/eschrock ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Re: Efficiency when reading the same file blocks
Toby Thain, No, physical location was for the exact location and logical was for the rest of my info. But, what I might not have made clear was the use of fragments. Their are two types of fragments. One which is the partial use of a logical disk block and the other which I was also trying to refer to is the moving of modified sections of the file. The first use was well used with the Joy FFS implementation where a FS and drive tended to have a high cost per byte overhead and was fairly small. Now, lets make this perfectly clear. If a FS object is large and written somewhat in sequence as a stream of bytes and then random FS logical blocks or physical blocks are then modified, the new FS object will be less sequentially written and CAN decrease read performance. Sorry, I tend to care less about write performance, due to the fact that writes tend to be async without threads blocking waiting for their operation to complete. This will happen MOST as the FS fills and less optimal locations of the FS are found for the COW blocks. The same problem happens with memory with OSs that support multiple page sizes where a well used system may not be able to allocate large page sizes due to fragmentation. Yes, this is a overloaded term... :) Thus, FS performance may suffer even if their are just alot of 1 byte changes to frequently accessed FS objects. If this occurs, either keep a larger FS, clean out the FS more frequently, or backup, cleanup, and then restore to get newly sequental FS objects. Mitchell Erblich - Toby Thain wrote: On 28-Feb-07, at 6:43 PM, Erblichs wrote: ZFS Group, My two cents.. Currently, in my experience, it is a waste of time to try to guarantee exact location of disk blocks with any FS. ? Sounds like you're confusing logical location with physical location, throughout this post. I'm sure Roch meant logical location. --T A simple reason exception is bad blocks, a neighboring block will suffice. Second, current disk controllers have logic that translates and you can't be sure outside of the firmware where the disk block actually is. Yes, I wrote code in this area before. Third, some FSs, do a Read-Modify-Write, where the write is NOT, NOT, NOT overwriting the original location of the read. ... ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FAULTED ZFS volume even though it is mirrored
Heya, The label looks sane. Can you try running: Not sure if I should be reassured by that but I'll hold my hopes high. :) # dtrace -n vdev_set_state:entry'[EMAIL PROTECTED], args[3], stack()] = count()}' While executing 'zpool import' and send the output? Can you also send '::dis' output (from 'mdb -k') of the function immediately above vdev_set_state() in the above stacks? I think the function should be vdev_validate(), but I don't remember if it's the same in the ZFS version you have. Attached below. Forgive my ignorance but I don't understand what you're requesting with regard to 'mdb -k'? From what I can see 'vdev_propagate_state' appears to be the function immediately before vdev_set_state Thanks again for your assistance, greatly appreciated. Stuart --- [EMAIL PROTECTED] ~]$ dtrace -n vdev_set_state:entry'[EMAIL PROTECTED], args[3], stack()] = count()}' -c 'zpool import' dtrace: description 'vdev_set_state:entry' matched 1 probe pool: ax150 snip dtrace: pid 726 has exited 40 zfs`vdev_root_state_change+0x61 zfs`vdev_propagate_state+0x8f zfs`vdev_set_state+0xa8 zfs`vdev_mirror_state_change+0x32 zfs`vdev_propagate_state+0x8f zfs`vdev_set_state+0xa8 zfs`vdev_load+0x69 zfs`vdev_load+0x25 zfs`vdev_load+0x25 zfs`spa_load+0x487 zfs`spa_tryimport+0x87 zfs`zfs_ioc_pool_tryimport+0x4b zfs`zfsdev_ioctl+0x144 genunix`cdev_ioctl+0x1d specfs`spec_ioctl+0x50 genunix`fop_ioctl+0x1a genunix`ioctl+0xac unix`_sys_sysenter_post_swapgs+0x139 1 32 zfs`vdev_propagate_state+0xc6 zfs`vdev_set_state+0xa8 zfs`vdev_mirror_state_change+0x32 zfs`vdev_propagate_state+0x8f zfs`vdev_set_state+0xa8 zfs`vdev_load+0x69 zfs`vdev_load+0x25 zfs`vdev_load+0x25 zfs`spa_load+0x487 zfs`spa_tryimport+0x87 zfs`zfs_ioc_pool_tryimport+0x4b zfs`zfsdev_ioctl+0x144 genunix`cdev_ioctl+0x1d specfs`spec_ioctl+0x50 genunix`fop_ioctl+0x1a genunix`ioctl+0xac unix`_sys_sysenter_post_swapgs+0x139 4 32 zfs`vdev_propagate_state+0xc6 zfs`vdev_set_state+0xa8 zfs`vdev_mirror_state_change+0x45 zfs`vdev_propagate_state+0x8f zfs`vdev_set_state+0xa8 zfs`vdev_load+0x69 zfs`vdev_load+0x25 zfs`vdev_load+0x25 zfs`spa_load+0x487 zfs`spa_tryimport+0x87 zfs`zfs_ioc_pool_tryimport+0x4b zfs`zfsdev_ioctl+0x144 genunix`cdev_ioctl+0x1d specfs`spec_ioctl+0x50 genunix`fop_ioctl+0x1a genunix`ioctl+0xac unix`_sys_sysenter_post_swapgs+0x139 4 33 zfs`vdev_root_state_change+0x40 zfs`vdev_propagate_state+0x8f zfs`vdev_set_state+0xa8 zfs`vdev_mirror_state_change+0x32 zfs`vdev_propagate_state+0x8f zfs`vdev_set_state+0xa8 zfs`vdev_load+0x69 zfs`vdev_load+0x25 zfs`vdev_load+0x25 zfs`spa_load+0x487 zfs`spa_tryimport+0x87 zfs`zfs_ioc_pool_tryimport+0x4b zfs`zfsdev_ioctl+0x144 genunix`cdev_ioctl+0x1d specfs`spec_ioctl+0x50 genunix`fop_ioctl+0x1a genunix`ioctl+0xac unix`_sys_sysenter_post_swapgs+0x139 4 32 zfs`vdev_load+0x96 zfs`vdev_load+0x25 zfs`spa_load+0x487 zfs`spa_tryimport+0x87 zfs`zfs_ioc_pool_tryimport+0x4b zfs`zfsdev_ioctl+0x144 genunix`cdev_ioctl+0x1d specfs`spec_ioctl+0x50 genunix`fop_ioctl+0x1a genunix`ioctl+0xac unix`_sys_sysenter_post_swapgs+0x139 5 33 zfs`vdev_root_state_change+0x40 zfs`vdev_propagate_state+0x8f zfs`vdev_set_state+0xa8 zfs`vdev_mirror_state_change+0x45 zfs`vdev_propagate_state+0x8f zfs`vdev_set_state+0xa8 zfs`vdev_load+0x69 zfs`vdev_load+0x25 zfs`vdev_load+0x25 zfs`spa_load+0x487 zfs`spa_tryimport+0x87 zfs`zfs_ioc_pool_tryimport+0x4b zfs`zfsdev_ioctl+0x144 genunix`cdev_ioctl+0x1d specfs`spec_ioctl+0x50 genunix`fop_ioctl+0x1a
Re: [zfs-discuss] FAULTED ZFS volume even though it is mirrored
Heya, Sorry. Try 'echo vdev_load::dis | mdb -k'. This will give the disassembly for vdev_load() in your current kernel (which will help us pinpoint what vdev_load+0x69 is really doing). Ahh, thanks for that. Attached. Stuart --- [EMAIL PROTECTED] ~]$ echo vdev_load::dis | mdb -k vdev_load: pushq %rbp vdev_load+1:movq %rsp,%rbp vdev_load+4:pushq %r12 vdev_load+6:movq %rdi,%r12 vdev_load+9:pushq %rbx vdev_load+0xa: xorl %ebx,%ebx vdev_load+0xc: cmpq $0x0,0x68(%rdi) vdev_load+0x11: je +0x1evdev_load+0x2f vdev_load+0x13: xorl %edx,%edx vdev_load+0x15: movq 0x60(%r12),%rax vdev_load+0x1a: incl %ebx vdev_load+0x1c: movq (%rax,%rdx,8),%rdi vdev_load+0x20: call -0x20vdev_load vdev_load+0x25: movslq %ebx,%rdx vdev_load+0x28: cmpq 0x68(%r12),%rdx vdev_load+0x2d: jb -0x18vdev_load+0x15 vdev_load+0x2f: cmpq %r12,0x50(%r12) vdev_load+0x34: je +0x3avdev_load+0x6e vdev_load+0x36: movq 0x38(%r12),%rax vdev_load+0x3b: movl 0x40(%rax),%r8d vdev_load+0x3f: testl %r8d,%r8d vdev_load+0x42: jne+0x7 vdev_load+0x49 vdev_load+0x44: popq %rbx vdev_load+0x45: popq %r12 vdev_load+0x47: leave vdev_load+0x48: ret vdev_load+0x49: movq %r12,%rdi vdev_load+0x4c: call -0x35c vdev_dtl_load vdev_load+0x51: testl %eax,%eax vdev_load+0x53: je -0xf vdev_load+0x44 vdev_load+0x55: movq %r12,%rdi vdev_load+0x58: movl $0x2,%ecx vdev_load+0x5d: movl $0x3,%edx vdev_load+0x62: xorl %esi,%esi vdev_load+0x64: call +0xb7c vdev_set_state vdev_load+0x69: popq %rbx vdev_load+0x6a: popq %r12 vdev_load+0x6c: leave vdev_load+0x6d: ret vdev_load+0x6e: cmpq $0x0,0x20(%r12) vdev_load+0x74: je +0xe vdev_load+0x82 vdev_load+0x76: cmpq $0x0,0x18(%r12) vdev_load+0x7c: nop vdev_load+0x80: jne+0x18vdev_load+0x98 vdev_load+0x82: movl $0x2,%ecx vdev_load+0x87: movl $0x3,%edx vdev_load+0x8c: xorl %esi,%esi vdev_load+0x8e: movq %r12,%rdi vdev_load+0x91: call +0xb4f vdev_set_state vdev_load+0x96: jmp-0x60vdev_load+0x36 vdev_load+0x98: xorl %esi,%esi vdev_load+0x9a: movq %r12,%rdi vdev_load+0x9d: call -0xefd vdev_metaslab_init vdev_load+0xa2: testl %eax,%eax vdev_load+0xa4: je -0x6evdev_load+0x36 vdev_load+0xa6: jmp-0x24vdev_load+0x82 [EMAIL PROTECTED] ~]$ ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] FAULTED ZFS volume even though it is mirrored
Further to this, I've considered doing the following: 1) Doing a zpool destroy on the volume 2) Doing a zpool import -D on the volume It would appear to me that primarily what has occurred is one or all of the metadata stores ZFS has created have become corrupt? Will a zpool import -D ignore metadata and rebuild using some magic foo? I still don't understand how even if an entire array became broken these issues were then synced to the backup array. Confused and bewildered, Stuart :) On Wed, 2007-02-28 at 16:55 -0800, Eric Schrock wrote: On Thu, Mar 01, 2007 at 10:50:28AM +1000, Stuart Low wrote: Heya, Sorry. Try 'echo vdev_load::dis | mdb -k'. This will give the disassembly for vdev_load() in your current kernel (which will help us pinpoint what vdev_load+0x69 is really doing). Ahh, thanks for that. snip ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss