Re: [zfs-discuss] Zfs send speed. Was: User quota design discussion..
Jorgen Lundman wrote: We finally managed to upgrade the production x4500s to Sol 10 10/08 (unrelated to this) but with the hope that it would also make zfs send usable. Exactly how does build 105 translate to Solaris 10 10/08? My current There is no easy/obvious mapping of Solaris Nevada builds to Solaris 10 update releases. Solaris Nevada started as a branch of S10 after it was released and is the place where new features (RFEs) are developed. For a bug fix or RFE to end up in Solaris 10 update release it needs to match certain criteria. Basically, only those CRs which are found necessary (and this applies to both bugs and features) are backported to S10uX. v. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL SSD performance testing... -IOzone works great, others not so great
Thanks for the explanation folks. So if I cannot get Apache/Webdav to write synchronously, (and it does not look like I can), then is it possible to tune the ARC to be more write-buffered heavy? My biggest problem is with very quick spikes in writes periodically throughout the day. If I were able to buffer these better, I would be in pretty good shape. The machines are already (economically) maxed out on ram at 32 gigs. If I were to add in the SSD L2ARC devices for read caching, can I configure the ARC to give up some of it's read caching for more write buffering? Thanks. Neil Perrin wrote: Patrick, The ZIL is only used for synchronous requests like O_DSYNC/O_SYNC and fsync(). Your iozone command must be doing some synchronous writes. All the other tests (dd, cat, cp, ...) do everything asynchronously. That is they do not require the data to be on stable storage on return from the write. So asynchronous writes get cached in memory (the ARC) and written out periodically (every 30 seconds or less) when the transaction group commits. The ZIL would be heavily used if your system were a NFS server. Databases also do synchronous writes. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] vdev_disk_io_start() sending NULL pointer in ldi_ioctl()
On Thu, 9 Apr 2009, shyamali.chakrava...@sun.com wrote: Hi All, I have corefile where we see NULL pointer de-reference PANIC as we have sent (deliberately) NULL pointer for return value. vdev_disk_io_start() error = ldi_ioctl(dvd-vd_lh, zio-io_cmd, (uintptr_t)zio-io_dk_callback, FKIOCTL, kcred, NULL); Note that it's not just in vdev_disk_io_start() that we pass NULL. It's everywhere - there are four calls in vdev_disk.c to ldi_ioctl, and they all pass NULL. ldi_ioctl() expects last parameter as an integer pointer ( int *rvalp). I see that in strdoictl(). I'm curious about your configuration. What is the setup you've got that is going through stream i/o? Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] vdev_disk_io_start() sending NULL pointer in ldi_ioctl()
shyamali.chakrava...@sun.com wrote: Hi All, I have corefile where we see NULL pointer de-reference PANIC as we have sent (deliberately) NULL pointer for return value. vdev_disk_io_start() ... ... error = ldi_ioctl(dvd-vd_lh, zio-io_cmd, (uintptr_t)zio-io_dk_callback, FKIOCTL, kcred, NULL); ldi_ioctl() expects last parameter as an integer pointer ( int *rvalp). I see that in strdoictl(). Corefile I am analysing has similar BAD trap while trying tostw%g0, [%i5] ( clr [%i5] ) This doesn't make since as strdoictl() should only be called on a stream. Normal call path should be to cdev_ioctl() and eventually to sdioctl(). Can you provide the stack? - George /* * Set return value. */ *rvalp = iocbp-ioc_rval; */ Is it a bug?? This code is all we do in vdev_disk_io_start(). I would appreciate any feedback on this. regards, --shyamali ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Efficient backup of ZFS filesystems?
On Thu, Apr 09, 2009 at 04:25:58PM +0200, Henk Langeveld wrote: Gary Mills wrote: I've been watching the ZFS ARC cache on our IMAP server while the backups are running, and also when user activity is high. The two seem to conflict. Fast response for users seems to depend on their data being in the cache when it's needed. Most of the disk I/O seems to be writes in this situation. However, the backup needs to stat all files and read many of them. I'm assuming that all of this information is also added to the ARC cache, even though it may never be needed again. It must also evict user data from the cache, causing it to be reloaded every time it's needed. Find out whether you have a problem first. If not, don't worry, but read one. If you do have a problem, add memory or an L2ARC device. We do have a problem, but not with the backup itself. The backup is slow, but I expect that's just because it's reading a very large number of small files. Our problem is with normal IMAP operations becoming quite slow at times. I'm wondering if the backup is contributing to this problem. The ARC was designed to mitigate the effect of any single burst of sequential I/O, but the size of the cache dedicated to more Frequently used pages (the current working set) will still be reduced, depending on the amount of activity on either side of the cache. That's a nice design, better than a simple cache. As the ARC maintains a shadow list of recently evicted pages from both sides of the cache, such pages that are accessed again will then return to the 'Frequent' side of the cache. There will be continuous competition between 'Recent' and 'Frequent' sides of the ARC (and for convenience, I'm glossing over the existence of 'Locked' pages). Several reasons might cause pathological behaviour - a backup process might access the same metadata multiple times, causing that data to be promoted to 'Frequent', flushing out application related data. (ZFS does not differentiate between data and metadata for resource allocation, they all use the same I/O mechanism and cache.) That might be possible in our case. On the other hand, you might just not have sufficient memory to keep most of your metadata in the cache, or the backup process is just too aggressive. Adding memory or an L2cache might help. We've added memory. That did seem to help, although the problem's still there. I assume the L2cache is not available in Solaris 10. -- -Gary Mills--Unix Support--U of M Academic Computing and Networking- ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZFS Panic
r == Rince rincebr...@gmail.com writes: r *ZFS* shouldn't panic under those conditions. The disk layer, r perhaps, but not ZFS. well, yes, but panicing brings down the whole box anyway so there is no practical difference, just a difference in blame. I would rather say, the fact that redundant ZFS ought to be the best-practice proper way to configure ~all filesystems in the future, means that disk drivers in the future ought to expect to have ZFS above them, so panicing when enough drives are still available to keep the pool up isn't okay, and also it's not okay to let problems with one drive interrupt access to other drives, and finally we've still no reasonably-practiceable consensus on how to deal with timeout problems, like vanishing iSCSI targets, and ATA targets that remain present but take 1000x longer to respond to each command, as ATA disks often do when they're failing and as is, I suspect, well-handled by all the serious hardware RAID storage vendors. With some chips writing a good driver has proven (on Linux) to be impossible, or beyond the skill of the person who adopted the chip, or beyond the effort warranted by the chip's interestingness. well, fine, but these things are certainly important enough to document, and on Linux they ARE documented: http://ata.wiki.kernel.org/index.php/SATA_hardware_features It's kind of best-effort, but still it's a lot better than ``all those problems on X4500 were fixed AGES ago, just upgrade'' / ``still having problems'' / ``ok they are all fixed now'' / ``no they're not, still can't hotplug, still no NCQ'' / ``well they are much more stable now.'' / ``can I hotplug? is NCQ working?'' / ... Note the LSI 1068 IT-mode cards driven by the proprietary 'mpt' driver are supported, by a GPL driver, on Linux, and smartctl works on these cards. but they don't appear on the wiki above, so Linux's list of chip features isn't complete, but it's a start. r As far as it should be concerned, it's equivalent to ejecting r a disk via cfgadm without telling ZFS first, which *IS* a r supported operation. an interesting point! Either way, though, we're responsible for the whole system. ``Our new handsets have microkernels, which is excellent for reliability! In the future, when there's a bug, it won't crash the whole celfone. It'll just crash the, ahh, the Phone Application.'' rght, sure, but SO WHAT?! pgpFIN4mApS1p.pgp Description: PGP signature ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Data size grew.. with compression on
David Magda dma...@ee.ryerson.ca writes: On Apr 7, 2009, at 16:43, OpenSolaris Forums wrote: if you have a snapshot of your files and rsync the same files again, you need to use --inplace rsync option , otherwise completely new blocks will be allocated for the new files. that`s because rsync will write entirely new file and rename it over the old one. not sure if this applies here, but i think it`s worth mentioning and not obvious. With ZFS new blocks will always be allocated: it's copy-on-write (COW) file system. So who is right here... Daniel Rock says he can see on disk that it doesn't work that way... that is only a small amount of space is taken when rsyncing in this way. See his post: From: Daniel Rock sola...@deadcafe.de Subject: Re: Data size grew.. with compression on Newsgroups: gmane.os.solaris.opensolaris.zfs To: zfs-discuss@opensolaris.org Date: Thu, 09 Apr 2009 16:35:07 +0200 Message-ID: 49de079b.2040...@deadcafe.de [...] Johnathon wrote: ZFS will allocate new blocks either way Daniel R replied: No it won't. --inplace doesn't rewrite blocks identical on source and target but only blocks which have been changed. I use rsync to synchronize a directory with a few large files (each up to 32 GB). Data normally gets appended to one file until it reaches the size limit of 32 GB. Before I used --inplace a snapshot needed on average ~16 GB. Now with --inplace it is just a few kBytes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Can zfs snapshot nfs mounts
I'm sorry if this either obvious or has beaten to death... I'm looking for ways to backup data on a linux server that has been using rsync with the script `rsnapshot'. Some of you may know how that works... I won't explain it here other than to say only changed data gets rsynced to the backup data. I'm wondering if I could do something similar by making that data directory hierarchy available on a osol.ll build 110 zfs server as an NFS mount. That is, can I nfs mount a remote filesystem on a zpool and use zfs snapshot functionality to create snapshots of that data? I'm not at all familiar with using zfs in general but was thinking something like: Make a directory hierarchy on a remote linux machine availabe for nfs mount. Mount the nfs share on osol server, inside a zpool. Do whatever is the correct way on zfs to create a snapshot of that mounted data and write it onto another directory also inside the zpool. A week later mount the same nfs share from remote linux machine and create a second snapshot written to the same other directory. Will that procedure produce a snapshot of first the full base data, and second time around only the changed data in comparison to the first snapshot? So proceeding in that manner, I'd have a series of snapshots were I could trace any differences in files? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can zfs snapshot nfs mounts
On Fri, 10 Apr 2009 13:18:05 -0500, Harry Putnam rea...@newsguy.com wrote: I'm sorry if this either obvious or has beaten to death... I'm looking for ways to backup data on a linux server that has been using rsync with the script `rsnapshot'. Some of you may know how that works... I won't explain it here other than to say only changed data gets rsynced to the backup data. I'm wondering if I could do something similar by making that data directory hierarchy available on a osol.ll build 110 zfs server as an NFS mount. That is, can I nfs mount a remote filesystem on a zpool Yes and use zfs snapshot functionality to create snapshots of that data? No. I'm not at all familiar with using zfs in general but was thinking something like: Make a directory hierarchy on a remote linux machine availabe for nfs mount. Mount the nfs share on osol server, inside a zpool. Do whatever is the correct way on zfs to create a snapshot of that mounted data and write it onto another directory also inside the zpool. A week later mount the same nfs share from remote linux machine and create a second snapshot written to the same other directory. Will that procedure produce a snapshot of first the full base data, and second time around only the changed data in comparison to the first snapshot? So proceeding in that manner, I'd have a series of snapshots were I could trace any differences in files? Won't work. The filesystem you mount is not a zfs filesystem. The mountpoint can be in a zfs in a zpool, but that doesn't make it a zfs. -- ( Kees Nuyt ) c[_] ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL SSD performance testing... -IOzone works great, others not so great
On Fri, Apr 10 at 8:07, Patrick Skerrett wrote: Thanks for the explanation folks. So if I cannot get Apache/Webdav to write synchronously, (and it does not look like I can), then is it possible to tune the ARC to be more write-buffered heavy? My biggest problem is with very quick spikes in writes periodically throughout the day. If I were able to buffer these better, I would be in pretty good shape. The machines are already (economically) maxed out on ram at 32 gigs. If I were to add in the SSD L2ARC devices for read caching, can I configure the ARC to give up some of it's read caching for more write buffering? I think in most cases, the raw spindle throughput should be enough to handle your load, or else you haven't sized your arrays properly. Bursts of async writes of relatively large size should be headed to the media at somewhere around 50-100MB/s/vdev I would think. How much burst IO do you have? -- Eric D. Mudama edmud...@mail.bounceswoosh.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL SSD performance testing... -IOzone works great, others not so great
More than that :) It's very very short duration, but we have the potential for 10's of thousands of clients doing writes all at the same time. I have the farm spread out over 16 servers, each with 2x 4GB fiber cards into big disk arrays, but my reads do get slow (resulting in end user experience degradation) when these write bursts come in, and if I could buffer them even for 60 seconds, it would make everything much smoother. Is there a way to optimize the ARC for more write buffering, and push more read caching off into the L2ARC? Again, I'm only worried about short bursts that happen once or twice a day. The rest of the time everything runs very smooth. Thanks. Eric D. Mudama wrote: On Fri, Apr 10 at 8:07, Patrick Skerrett wrote: Thanks for the explanation folks. So if I cannot get Apache/Webdav to write synchronously, (and it does not look like I can), then is it possible to tune the ARC to be more write-buffered heavy? My biggest problem is with very quick spikes in writes periodically throughout the day. If I were able to buffer these better, I would be in pretty good shape. The machines are already (economically) maxed out on ram at 32 gigs. If I were to add in the SSD L2ARC devices for read caching, can I configure the ARC to give up some of it's read caching for more write buffering? I think in most cases, the raw spindle throughput should be enough to handle your load, or else you haven't sized your arrays properly. Bursts of async writes of relatively large size should be headed to the media at somewhere around 50-100MB/s/vdev I would think. How much burst IO do you have? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL SSD performance testing... -IOzone works great, others not so great
On Fri, 10 Apr 2009, Patrick Skerrett wrote: degradation) when these write bursts come in, and if I could buffer them even for 60 seconds, it would make everything much smoother. ZFS already batches up writes into a transaction group, which currently happens every 30 seconds. Have you tested zfs against a real-world workload? Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL SSD performance testing... -IOzone works great, others not so great
Yes, we are currently running ZFS, just without L2 ARC, or offloaded ZIL. Mark J Musante wrote: On Fri, 10 Apr 2009, Patrick Skerrett wrote: degradation) when these write bursts come in, and if I could buffer them even for 60 seconds, it would make everything much smoother. ZFS already batches up writes into a transaction group, which currently happens every 30 seconds. Have you tested zfs against a real-world workload? Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can zfs snapshot nfs mounts
On Fri, Apr 10, 2009 at 01:18:05PM -0500, Harry Putnam wrote: I'm looking for ways to backup data on a linux server that has been using rsync with the script `rsnapshot'. Some of you may know how that works... I won't explain it here other than to say only changed data gets rsynced to the backup data. I'm wondering if I could do something similar by making that data directory hierarchy available on a osol.ll build 110 zfs server as an NFS mount. Not unless the data is within the ZFS pool. That is, can I nfs mount a remote filesystem on a zpool and use zfs snapshot functionality to create snapshots of that data? No. You can't mount a filesystem into a zpool. The snapshots are easy for ZFS to do because it owns the data and it knows about every changed block. It wouldn't be able to see that on a remote NFS filesystem. The client has to scan every file (like rsync) to find changes. So a ZFS host could own the data, then *it* could be the NFS server and it would work the way you want. But it won't work if the ZFS host is just an NFS client. -- Darren ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can zfs snapshot nfs mounts
[...] Mount the nfs share on osol server, inside a zpool. Do whatever is the correct way on zfs to create a snapshot of that mounted data and write it onto another directory also inside the zpool. A week later mount the same nfs share from remote linux machine and create a second snapshot written to the same other directory. [...] So proceeding in that manner, I'd have a series of snapshots were I could trace any differences in files? Won't work. The filesystem you mount is not a zfs filesystem. The mountpoint can be in a zfs in a zpool, but that doesn't make it a zfs. Gack... I was afraid of that... it sounded way to easy Thanks for the input... I guess I'll have to wait and see what is resolved in the other thread about what happens when you rsync data to a zfs filesystem and how that inter plays with zfs snapshots going on too. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Data size grew.. with compression on
On 10-Apr-09, at 2:03 PM, Harry Putnam wrote: David Magda dma...@ee.ryerson.ca writes: On Apr 7, 2009, at 16:43, OpenSolaris Forums wrote: if you have a snapshot of your files and rsync the same files again, you need to use --inplace rsync option , otherwise completely new blocks will be allocated for the new files. that`s because rsync will write entirely new file and rename it over the old one. not sure if this applies here, but i think it`s worth mentioning and not obvious. With ZFS new blocks will always be allocated: it's copy-on-write (COW) file system. So who is right here... As far as I can see - the effect of --inplace would be that new blocks are allocated for the deltas, not the whole file, so Daniel Rock's finding does not contradict OpenSolaris Forums. But in either case, COW is involved. --Toby Daniel Rock says he can see on disk that it doesn't work that way... that is only a small amount of space is taken when rsyncing in this way. See his post: From: Daniel Rock sola...@deadcafe.de Subject: Re: Data size grew.. with compression on Newsgroups: gmane.os.solaris.opensolaris.zfs To: zfs-discuss@opensolaris.org Date: Thu, 09 Apr 2009 16:35:07 +0200 Message-ID: 49de079b.2040...@deadcafe.de [...] Johnathon wrote: ZFS will allocate new blocks either way Daniel R replied: No it won't. --inplace doesn't rewrite blocks identical on source and target but only blocks which have been changed. I use rsync to synchronize a directory with a few large files (each up to 32 GB). Data normally gets appended to one file until it reaches the size limit of 32 GB. Before I used --inplace a snapshot needed on average ~16 GB. Now with --inplace it is just a few kBytes. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] vdev_disk_io_start() sending NULL pointer in ldi_ioctl()
Hi Mark, Thanks for responding. In my case cdev_ioctl() is going through vxdmp:dmpioctl() pc: 0x134ccb8 vxdmp:dmpioctl+0x8: stw%g0, [%i5] ( clr [%i5] ) npc: 0x134ccbc vxdmp:dmpioctl+0xc: or %g0, %i0, %o0 ( mov %i0, %o0 ) trapvxdmp:dmpioctl+0x8(, 0x422, 0x3000c3b7108, 0x8020, 0x60032c03df0, 0x0) genunix:ldi_ioctl(0x6004c8cce78, 0x422, 0x3000c3b7108, 0x8000, 0x60032c03df0, 0x0) - frame recycled zfs:vdev_disk_io_start+0xc8() zfs:zio_vdev_io_start(0x3000c3b6eb8) - frame recycled As we see %i5 is NULL from ldi_ioctl() we panic here. --shyamali On 04/10/09 06:26, Mark J Musante wrote: On Thu, 9 Apr 2009, shyamali.chakrava...@sun.com wrote: Hi All, I have corefile where we see NULL pointer de-reference PANIC as we have sent (deliberately) NULL pointer for return value. vdev_disk_io_start() error = ldi_ioctl(dvd-vd_lh, zio-io_cmd, (uintptr_t)zio-io_dk_callback, FKIOCTL, kcred, NULL); Note that it's not just in vdev_disk_io_start() that we pass NULL. It's everywhere - there are four calls in vdev_disk.c to ldi_ioctl, and they all pass NULL. ldi_ioctl() expects last parameter as an integer pointer ( int *rvalp). I see that in strdoictl(). I'm curious about your configuration. What is the setup you've got that is going through stream i/o? Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Can zfs snapshot nfs mounts
On Sat, Apr 11, 2009 at 4:41 AM, Harry Putnam rea...@newsguy.com wrote: Thanks for the input... I guess I'll have to wait and see what is resolved in the other thread about what happens when you rsync data to a zfs filesystem and how that inter plays with zfs snapshots going on too. Is there anything to wait? Didn't that thread already mention that using rsync with --inplace and zfs snapshots work as expected? Or do you ave other problems? ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL SSD performance testing... -IOzone works great, others not so great
On 10-Apr-09, at 5:05 PM, Mark J Musante wrote: On Fri, 10 Apr 2009, Patrick Skerrett wrote: degradation) when these write bursts come in, and if I could buffer them even for 60 seconds, it would make everything much smoother. ZFS already batches up writes into a transaction group, which currently happens every 30 seconds. Isn't that 5 seconds? --T Have you tested zfs against a real-world workload? Regards, markm ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] ZIL SSD performance testing... -IOzone works great, others not so great
On 04/10/09 20:15, Toby Thain wrote: On 10-Apr-09, at 5:05 PM, Mark J Musante wrote: On Fri, 10 Apr 2009, Patrick Skerrett wrote: degradation) when these write bursts come in, and if I could buffer them even for 60 seconds, it would make everything much smoother. ZFS already batches up writes into a transaction group, which currently happens every 30 seconds. Isn't that 5 seconds? It used to be, and it may still be for what you are running. However, Mark is right, it is now 30 seconds. In fact 30s is the maximum. The actual time will depend on load. If the pool is heavily used then the txg's fire more frequently. Neil. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss