Re: [zfs-discuss] Re: Heavy writes freezing system
Jason J. W. Williams writes: Hi Anantha, I was curious why segregating at the FS level would provide adequate I/O isolation? Since all FS are on the same pool, I assumed flogging a FS would flog the pool and negatively affect all the other FS on that pool? Best Regards, Jason Good point, If the problem is 6413510 zfs: writing to ZFS filesystem slows down fsync() on other files Then the seggegration to 2 filesystem on the same pool will help. But if the problem is more like 6429205 each zpool needs to monitor its throughput and throttle heavy writers then it 2 FS won't help. 2 pools probably would though. -r ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Heavy writes freezing system
Bag-o-tricks-r-us, I suggest the following in such a case: - Two ZFS pools - One for production - One for Education The DBA's are very resistant to splitting our whole environments. There are nine on the test/devl server! So, we're going to put the DB files and redo logs on separate (UFS with directio) LUN's. Binaries and backups will go onto two separate ZFS LUN's. With production, they can do their cloning at night to minimize impact. Not sure what they'll do on test/devl. The two ZFS file systems will probably also be separate zpools (political as well as juggling Hitachi disk space reasons). BTW, it wasn't the storage guys who decided the one filesystem to rule them all strategy, but my predecessors. It was part of the move from Clarion arrays to Hitachi. The storage folks know about, understand, and agree with us when we talk about these kinds of issues (at least, they do now). We've pushed the caching and other subsystems often enough to make this painfully clear. Another thought is while ZFS works out its kinks why not use the BCV or ShadowCopy or whatever IBM calls it to create Education instance. This will reduce a tremendous amount of I/O. This means buying more software to alleviate a short-term problem (with RAC, the whole design will be different, including moving to ASM). We have RMAN and OEM already, so this argument won't fly. BTW, I'm curious what application using Oracle is creating more than a million files? Oracle Financials. The application includes everything but the kitchen sink (but the bathroom sink is there!). Thanks for all of your feedback and suggestions. They all sound bang on. If we could just get all the pieces in place to move forward now, I think we'll be OK. One big issue for us will be finding the Hitachi disk space--we're pretty full-up right now. :-( Rainer This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Heavy writes freezing system
Hello Anantha, Wednesday, January 17, 2007, 2:35:01 PM, you wrote: ANS You're probably hitting the same wall/bug that I came across; ANS ZFS in all versions up to and including Sol10U3 generates ANS excessive I/O when it encounters 'fssync' or if any of the files ANS were opened with 'O_DSYNC' option. ANS I do believe Oracle (or any DB for that matter) opens the file ANS with O_DSYNC option. During normal times it does result in ANS excessive I/O but is probably well under your system capacity (it ANS was in our case.) But when you are doing backups or clones ANS (Oracle clones by using RMAN or copying of db files?) you are ANS going to flood the I/O sub-system and that's when the whole ZFS ANS excessive I/O starts to put a hurt on the DB performance. ANS Here are a few suggestions that can give you interim relief: ANS - Seggregate your I/O at filesystem level; the bug is at the ANS filesystem level not ZFS pool level. By this I mean ensure the ANS online redo logs are in a ZFS FS that nobody else uses, same for ANS control files. As long as the writes to control and online redo ANS logs are met your system will be happy. ANS - Ensure that your clone and RMAN (if you're going to disk) ANS write to a seperate ZFS FS that contains no production files. ANS - If the above two items don't give you relieve then relocate ANS the online redo log and control files to a UFS filesystem. No ANS need to downgrade the entire ZFS to something else. ANS - Consider Oracle ASM (DB version permitting,) works very well. Why deal with VxFS. ANS Feel free to drop me a line, I've over 17 years of Oracle DB ANS experience and love to troubleshoot problems like this. I've ANS another vested interest; we're considering ZFS for widespread use ANS in our environment and any experience is good for us. ANS Also as an workaround you could disable zil if it's acceptable to you (in case of system panic or hard reset you can endup with unrecoverable database). -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Heavy writes freezing system
What do you mean by UFS wasn't an option due to number of files? Exactly that. UFS has a 1 million file limit under Solaris. Each Oracle Financials environment well exceeds this limitation. Also do you have any tunables in system? Can you send 'zpool status' output? (raidz, mirror, ...?) Our tunables are: set noexec_user_stack=1 set sd:sd_max_throttle = 32 set sd:sd_io_time = 0x3c zpool status: zpool status pool: d state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM dONLINE 0 0 0 c5t60060E800475AA0075AA100Bd0 ONLINE 0 0 0 c5t60060E800475AA0075AA100Dd0 ONLINE 0 0 0 c5t60060E800475AA0075AA100Cd0 ONLINE 0 0 0 c5t60060E800475AA0075AA100Ed0 ONLINE 0 0 0 errors: No known data errors When the DBA?s do clones - you mean that by just doing 'zfs clone ...' you get big performance problem? OR maybe just before when you do 'zfs snapshot' first? How much free space is left in a pool? Nope. The DBA group clones the production instance using OEM in order to build copies for Education, development, etc. This is strictly an Oracle function, not a file system (ZFS) operation. Do you have sar data when problems occured? Any paging in a system? Some. I'll have to have the other analyst try to pull out the times when our testing was done, but I've been told nothing stood out. (I love playing middle-man. NOT!) And one advise - before any more testing I would definitely upgrade/reinstall system to U3 when it comes to ZFS. Not an option. This isn't even a faint possibility. We're talking both our test/development servers, and our production/education. That's six servers to upgrade (remember, we have a the applications on servers distinct from the database servers--the DBA's would never let us divurge the OS releases). Rainer This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Heavy writes freezing system
Thanks for the feedback! This does sound like what we're hitting. From our testing, you are absolutely correct--separating out the parts is a major help. The big problem we still see, though, is doing the clones/recoveries. The DBA group clones the production environment for Education. Since both of these instances live on the same server and ZPool/filesystem, this kills the throughput. When doing cloning or backups to a different area, whether UFS or ZFS, we don't have the issues. I'll know for sure later today or tomorrow, but it sounds like they are seriously considering the ASM route. Since we will be going to RAC later this year, this move makes the most sense. We'll just have to hope that the DBA group gets a better understanding of LUN's and our SAN, as they'll be taking over part of the disk (LUN) management. :-/ We were hoping we could get some interrim relief on the ZFS front through tuning or something, but if what you're saying is correct (and it sounds like it is), we may be out of luck. Thanks very much for the feedback. Rainer This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Heavy writes freezing system
Rainer Heilke wrote: I'll know for sure later today or tomorrow, but it sounds like they are seriously considering the ASM route. Since we will be going to RAC later this year, this move makes the most sense. We'll just have to hope that the DBA group gets a better understanding of LUN's and our SAN, as they'll be taking over part of the disk (LUN) management. :-/ We were hoping we could get some interrim relief on the ZFS front through tuning or something, but if what you're saying is correct (and it sounds like it is), we may be out of luck. If you plan on RAC, then ASM makes good sense. It is unclear (to me anyway) if ASM over a zvol is better than ASM over a raw LUN. It would be nice to have some of the zfs features such as snapshots, without having to go through extraordinary pain or buy expensive RAID arrays. If someone has tried ASM on a zvol, please speak up :-) -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Heavy writes freezing system
Rainer Heilke wrote: What do you mean by UFS wasn't an option due to number of files? Exactly that. UFS has a 1 million file limit under Solaris. Each Oracle Financials environment well exceeds this limitation. Really?!? I thought Oracle would use a database for storage... Also do you have any tunables in system? Can you send 'zpool status' output? (raidz, mirror, ...?) Our tunables are: set noexec_user_stack=1 set sd:sd_max_throttle = 32 set sd:sd_io_time = 0x3c EMC? zpool status: zpool status pool: d state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM dONLINE 0 0 0 c5t60060E800475AA0075AA100Bd0 ONLINE 0 0 0 c5t60060E800475AA0075AA100Dd0 ONLINE 0 0 0 c5t60060E800475AA0075AA100Cd0 ONLINE 0 0 0 c5t60060E800475AA0075AA100Ed0 ONLINE 0 0 0 errors: No known data errors When the DBA?s do clones - you mean that by just doing 'zfs clone ...' you get big performance problem? OR maybe just before when you do 'zfs snapshot' first? How much free space is left in a pool? Nope. The DBA group clones the production instance using OEM in order to build copies for Education, development, etc. This is strictly an Oracle function, not a file system (ZFS) operation. Do you have sar data when problems occured? Any paging in a system? Some. I'll have to have the other analyst try to pull out the times when our testing was done, but I've been told nothing stood out. (I love playing middle-man. NOT!) And one advise - before any more testing I would definitely upgrade/reinstall system to U3 when it comes to ZFS. Not an option. This isn't even a faint possibility. We're talking both our test/development servers, and our production/education. That's six servers to upgrade (remember, we have a the applications on servers distinct from the database servers--the DBA's would never let us divurge the OS releases). Yes this is common, so you should look for the patches which should fix at least the fsync problem. Check the archives here for patch update info from George Wilson. -- richard ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Heavy writes freezing system
What do you mean by UFS wasn't an option due to number of files? Exactly that. UFS has a 1 million file limit under Solaris. Each Oracle Financials environment well exceeds this limitation. what ? $ uname -a SunOS core 5.10 Generic_118833-17 sun4u sparc SUNW,UltraSPARC-IIi-cEngine $ df -F ufs -t / (/dev/md/dsk/d0): 5367776 blocks 616328 files total: 13145340 blocks 792064 files /export/nfs(/dev/md/dsk/d8): 83981368 blocks 96621651 files total: 404209452 blocks 100534720 files /export/home (/dev/md/dsk/d7): 980894 blocks 260691 files total: 986496 blocks 260736 files $ I think that I am 95,621,651 files over your 1 million limit right there! Should I place a support call and file a bug report ? Dennis ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Heavy writes freezing system
Dennis Clarke wrote: What do you mean by UFS wasn't an option due to number of files? Exactly that. UFS has a 1 million file limit under Solaris. Each Oracle Financials environment well exceeds this limitation. what ? $ uname -a SunOS core 5.10 Generic_118833-17 sun4u sparc SUNW,UltraSPARC-IIi-cEngine $ df -F ufs -t / (/dev/md/dsk/d0): 5367776 blocks 616328 files total: 13145340 blocks 792064 files /export/nfs(/dev/md/dsk/d8): 83981368 blocks 96621651 files total: 404209452 blocks 100534720 files /export/home (/dev/md/dsk/d7): 980894 blocks 260691 files total: 986496 blocks 260736 files $ I think that I am 95,621,651 files over your 1 million limit right there! is that a multi-terabyte-UFS? if no, ignore :-), it yes, the actual limit is 1 million inode PER Terabyte. HTH -- Michael Schuster Sun Microsystems, Inc. ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: [zfs-discuss] Re: Heavy writes freezing system
Hi Anantha, I was curious why segregating at the FS level would provide adequate I/O isolation? Since all FS are on the same pool, I assumed flogging a FS would flog the pool and negatively affect all the other FS on that pool? Best Regards, Jason On 1/17/07, Anantha N. Srirama [EMAIL PROTECTED] wrote: You're probably hitting the same wall/bug that I came across; ZFS in all versions up to and including Sol10U3 generates excessive I/O when it encounters 'fssync' or if any of the files were opened with 'O_DSYNC' option. I do believe Oracle (or any DB for that matter) opens the file with O_DSYNC option. During normal times it does result in excessive I/O but is probably well under your system capacity (it was in our case.) But when you are doing backups or clones (Oracle clones by using RMAN or copying of db files?) you are going to flood the I/O sub-system and that's when the whole ZFS excessive I/O starts to put a hurt on the DB performance. Here are a few suggestions that can give you interim relief: - Seggregate your I/O at filesystem level; the bug is at the filesystem level not ZFS pool level. By this I mean ensure the online redo logs are in a ZFS FS that nobody else uses, same for control files. As long as the writes to control and online redo logs are met your system will be happy. - Ensure that your clone and RMAN (if you're going to disk) write to a seperate ZFS FS that contains no production files. - If the above two items don't give you relieve then relocate the online redo log and control files to a UFS filesystem. No need to downgrade the entire ZFS to something else. - Consider Oracle ASM (DB version permitting,) works very well. Why deal with VxFS. Feel free to drop me a line, I've over 17 years of Oracle DB experience and love to troubleshoot problems like this. I've another vested interest; we're considering ZFS for widespread use in our environment and any experience is good for us. This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Heavy writes freezing system
Bag-o-tricks-r-us, I suggest the following in such a case: - Two ZFS pools - One for production - One for Education - Isolate the LUNs feeding the pools if possible, don't share spindles. Remember on EMC/Hitachi you've logical LUNs created by striping/concat'ng carved up physical disks, so you could have two LUNs that share the same spindle. Don't believe one word from your storage admin about we've lot of cache to abstract the physical structure; Oracle can push any storage sub-system over the edge. Almost all of the storage vendors prevent one LUN from flooding the cache with writes, EMC gives no more than 8x the initial allocation of cache (total cache/total disk space) and after that it'll stall your writes until destage is complete. - At least two ZFS filesystems under Production pool - One for online redo logs and control files. If need be you can further seggregate them onto two seperate ZFS filesystems. - One for db files. If need be you can isolate further by data, index, temp, archived redo, ... - Don't host the 'temp' on ZFS, just feed it plain old UFS or raw disk. - Match up your ZFS recordsize with your DB blocksize * multi block read count. Don't do this for the index filesystem, just the filesystem hosting data Rinse and repeat for your Education ZFS pool. This will give you substantial isolation and improvement, sufficient enough to buy you time to plan out a better deployment strategy given that you're under the gun now. Another thought is while ZFS works out its kinks why not use the BCV or ShadowCopy or whatever IBM calls it to create Education instance. This will reduce a tremendous amount of I/O. Just this past weekend I re-did our SAS server to relocate [b]just[/b] the SAS work area to good ol' UFS and the payback is tremendous; not one complaint about performance 3 days in a row (we used to hear daily complaints.) By taking care of your online redo logs and control files (maybe skipping ZFS for it all together and running it on UFS) you'll breathe easier. BTW, I'm curious what application using Oracle is creating more than a million files? This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re[2]: [zfs-discuss] Re: Heavy writes freezing system
Hello Jason, Wednesday, January 17, 2007, 11:24:50 PM, you wrote: JJWW Hi Anantha, JJWW I was curious why segregating at the FS level would provide adequate JJWW I/O isolation? Since all FS are on the same pool, I assumed flogging a JJWW FS would flog the pool and negatively affect all the other FS on that JJWW pool? because of the bug which forces all outstanding writes in a file system to commit to storage in case of one fsync to one file. Now when you separate data to different file systems the bug will affect only data in that file system which could greatly reduce imapct on performance if it's done right. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Re: Re[2]: [zfs-discuss] Re: Heavy writes freezing system
Hi Robert, I see. So it really doesn't get around the idea of putting DB files and logs on separate spindles? Best Regards, Jason On 1/17/07, Robert Milkowski [EMAIL PROTECTED] wrote: Hello Jason, Wednesday, January 17, 2007, 11:24:50 PM, you wrote: JJWW Hi Anantha, JJWW I was curious why segregating at the FS level would provide adequate JJWW I/O isolation? Since all FS are on the same pool, I assumed flogging a JJWW FS would flog the pool and negatively affect all the other FS on that JJWW pool? because of the bug which forces all outstanding writes in a file system to commit to storage in case of one fsync to one file. Now when you separate data to different file systems the bug will affect only data in that file system which could greatly reduce imapct on performance if it's done right. -- Best regards, Robertmailto:[EMAIL PROTECTED] http://milek.blogspot.com ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[zfs-discuss] Re: Heavy writes freezing system
What hardware is used? Sparc? x86 32-bit? x86 64-bit? How much RAM is installed? Which version of the OS? Sorry, this is happening on two systems (test and production). They're both Solaris 10, Update 2. Test is a V880 with 8 CPU's and 32GB, production is an E2900 with 12 dual-core CPU's and 48GB. Did you already try to monitor kernel memory usage, while writing to zfs? Maybe the kernel is running out of free memory? (I've bugs like 6483887 in mind, without direct management, arc ghost lists can run amok) We haven't seen serious kernel memory usage that I know of (I'll be honest--I came into this problem late). For a live system: echo ::kmastat | mdb -k echo ::memstat | mdb -k I can try this if the DBA group is willing to do another test, thanks. In case you've got a crash dump for the hung system, you can try the same ::kmastat and ::memstat commands using the kernel crash dumps saved in directory /var/crash/`hostname` # cd /var/crash/`hostname` # mdb -k unix.1 vmcore.1 ::memstat ::kmastat The system doesn't actually crash. It also doesn't freeze _completely_. While I call it a freeze (best name for it), it actually just slows down incredibly. It's like the whole system bogs down like molasses in January. Things happen, but very slowly. Rainer This message posted from opensolaris.org ___ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss