Re: [zfs-discuss] [storage-discuss] dos programs on a

2008-02-06 Thread Maurilio Longo
Alan,

I'm using nexenta core rc4 which is based on nevada 81/82.

zfs casesensitivity is set to 'insensitive'

Best regards.

Maurilio.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] available space?

2008-02-06 Thread Jure Pečar

Maybe a basic zfs question ...

I have a pool:

# zpool status backup
  pool: backup
 state: ONLINE
 scrub: none requested
config:

NAME  STATE READ WRITE CKSUM
backupONLINE   0 0 0
  mirror  ONLINE   0 0 0
c1t0d0s1  ONLINE   0 0 0
c2t0d0s1  ONLINE   0 0 0
  raidz2  ONLINE   0 0 0
c1t1d0ONLINE   0 0 0
c1t2d0ONLINE   0 0 0
c1t3d0ONLINE   0 0 0
c1t4d0ONLINE   0 0 0
c1t5d0ONLINE   0 0 0
c1t6d0ONLINE   0 0 0
c1t7d0ONLINE   0 0 0
c2t1d0ONLINE   0 0 0
c2t2d0ONLINE   0 0 0
c2t3d0ONLINE   0 0 0
c2t4d0ONLINE   0 0 0
c2t5d0ONLINE   0 0 0
c2t6d0ONLINE   0 0 0
c2t7d0ONLINE   0 0 0

For which list reports:

# zpool list backup
NAME SIZE   USED  AVAILCAP  HEALTH  ALTROOT
backup  13.5T   434K  13.5T 0%  ONLINE  -


Yet df and zfs list shows something else:

[EMAIL PROTECTED]:~# zfs list
NAME USED  AVAIL  REFER  MOUNTPOINT
backup   262K  11.5T  1.78K  none
backup/files32.0K  11.5T  32.0K  /export/files
...

# df -h
Filesystem size   used  avail capacity  Mounted on
...
backup/files11T32K11T 1%/export/files


Why does AVAIL differ for such a large amount?

(NexentaOS_20080131)

-- 

Jure Pečar
http://jure.pecar.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] status of zfs boot netinstall kit

2008-02-06 Thread Roman Morokutti
Hi,

I would like to continue this (maybe a bit outdated) thread with
the question:

   1. How to create a netinstall image?
   2. How to write the netinstall image back as an ISO9660 on DVD?
   (after patching it for the zfsboot)

Roman
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zpool status -x strangeness on b78

2008-02-06 Thread Ben Miller
We run a cron job that does a 'zpool status -x' to check for any degraded 
pools.  We just happened to find a pool degraded this morning by running 'zpool 
status' by hand and were surprised that it was degraded as we didn't get a 
notice from the cron job.

# uname -srvp
SunOS 5.11 snv_78 i386

# zpool status -x
all pools are healthy

# zpool status pool1
  pool: pool1
 state: DEGRADED
 scrub: none requested
config:

NAME STATE READ WRITE CKSUM
pool1DEGRADED 0 0 0
  raidz1 DEGRADED 0 0 0
c1t8d0   REMOVED  0 0 0
c1t9d0   ONLINE   0 0 0
c1t10d0  ONLINE   0 0 0
c1t11d0  ONLINE   0 0 0

errors: No known data errors

I'm going to look into it now why the disk is listed as removed.

Does this look like a bug with 'zpool status -x'?

Ben
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] mounting a copy of a zfs pool /file system while orginal is still active

2008-02-06 Thread eric kustarz

 While browsing the ZFS source code, I noticed that usr/src/cmd/ 
 ztest/ztest.c, includes ztest_spa_rename(), a ZFS test which  
 renames a ZFS storage pool to a different name, tests the pool  
 under its new name, and then renames it back. I wonder why this  
 functionality was not exposed as part of zpool support?


See 6280547 want to rename pools.

Just hasn't been hight on the priority list.

eric
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS number of file systems scalability

2008-02-06 Thread Shawn Ferry

There is a write up of similar findings and more information about  
sharemgr
http://developers.sun.com/solaris/articles/nfs_zfs.html

Unfortunately I don't see anything that says those changes will be in  
u5.

Shawn

On Feb 5, 2008, at 8:21 PM, Paul B. Henson wrote:


 I was curious to see about how many filesystems one server could
 practically serve via NFS, and did a little empirical testing.

 Using an x4100M2 server running S10U4x86, I created a pool from a  
 slice of
 the hardware raid array created from the two internal hard disks,  
 and set
 sharenfs=on for the pool.

 I then created filesystems, 1000 at a time, and timed how long it  
 took to
 create each thousand filesystems, to set sharenfs=off for all  
 filesystems
 created so far, and to set sharenfs=on again for all filesystems. I
 understand sharetab optimization is one of the features in the latest
 OpenSolaris, so just for fun I tried symlinking /etc/dfs/sharetab to  
 a mfs
 file system to see if it made any difference. I also timed a  
 complete boot
 cycle (from typing 'init 6' until the server was again remotely  
 available)
 at 5000 and 10,000 filesystems.

 Interestingly, filesystem creation itself scaled reasonably well. I
 recently read a thread where someone was complaining it took over  
 eight
 minutes to create a filesystem at the 10,000 filesystem count. In my  
 tests,
 while the first 1000 filesystems averaged only a little more than  
 half a
 second each to create, filesystems 9000-1 only took roughly  
 twice that,
 averaging about 1.2 seconds each to create.

 Unsharing scalability wasn't as good, time requirements increasing  
 by a
 factor of six. Having sharetab in mfs made a slight difference, but  
 nothing
 outstanding. Sharing (unsurprisingly) was the least scalable,  
 increasing by
 a factor of eight.

 Boot-wise, the system took about 10.5 minutes to reboot at 5000
 filesystems. This increased to about 35 minutes at the 10,000 file  
 system
 counts.

 Based on these numbers, I don't think I'd want to run more than 5-7
 thousand filesystems per server to avoid extended outages. Given our  
 user
 count, that will probably be 6-10 servers 8-/. I suppose we could  
 have a
 large number of smaller servers rather than a small number of beefier
 servers; although that seems less than efficient. It's too bad  
 there's no
 way to fast track backporting of openSolaris improvements to  
 production
 Solaris, from what I've heard there will be virtually no ZFS  
 improvements
 in S10U5 :(.

 Here are the raw numbers for anyone interested. The first column is  
 number
 of file systems. The second column is total and average time in  
 seconds to
 create that block of filesystems (eg, the first 1000 took 589  
 seconds to
 create, the second 1000 took 709 seconds). The third column is the  
 time in
 seconds to turn off NFS sharing for all filesystems created so far  
 (eg, 14
 seconds for 1000 filesystems, 38 seconds for 2000 filesystems). The  
 fourth
 is the same operation with sharetab in a memory filesystem (I  
 stopped this
 measurement after 7000 because sharing was starting to take so  
 long). The
 final column is how long it took to turn on NFS sharing for all  
 filesystems
 created so far.


 #FS create/avgoff/avg  off(mfs)/avg  on/avg
 1000 589/.59  14/.01 9/.01   32/.03
 2000 709/.71  38/.02 25/.01  107/.05
 3000 783/.78  70/.02 50/.02  226/.08
 4000 836/.84  112/.0383/.02  388/.10
 5000 968/.97  178/.04124/.02 590/.12
 6000 930/.93 245/.04 172/.03 861/.14
 7000 961/.96 319/.05 229/.03 1172/.17
 8000 1045/1.05   405/.05 1515/.19
 9000 1098/1.10   500/.06 1902/.21
 11165/1.17   599/.06 2348/.23


 -- 
 Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/ 
 ~henson/
 Operating Systems and Network Analyst  |  [EMAIL PROTECTED]
 California State Polytechnic University  |  Pomona CA 91768
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

--
Shawn Ferry  shawn.ferry at sun.com
Senior Primary Systems Engineer
Sun Managed Operations
571.291.4898





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Did MDB Functionality Change?

2008-02-06 Thread spencer
On Solaris 10 u3 (11/06) I can execute the following:

bash-3.00# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs sd pcipsy ip sctp usba 
nca md zfs random ipc nfs crypto cpc fctl fcip logindmux ptm sppp ]
 arc::print
{
anon = ARC_anon
mru = ARC_mru
mru_ghost = ARC_mru_ghost
mfu = ARC_mfu
mfu_ghost = ARC_mfu_ghost
size = 0x6b800
p = 0x3f83f80
c = 0x7f07f00
c_min = 0x7f07f00
c_max = 0xbe8be800
hits = 0x30291
misses = 0x4f
deleted = 0xe
skipped = 0
hash_elements = 0x3a
hash_elements_max = 0x3a
hash_collisions = 0x3
hash_chains = 0x1
hash_chain_max = 0x1
no_grow = 0
}

However, when I execute the same command on Solaris 10 u4 (8/07) I receive the 
following error:

bash-3.00# mdb -k
Loading modules: [ unix krtld genunix specfs dtrace ufs ssd fcp fctl qlc pcisch 
md ip hook neti sctp arp usba nca lofs logindmux ptm cpc fcip sppp random sd 
crypto zfs ipc nfs ]
 arc::print
mdb: failed to dereference symbol: unknown symbol name

In addition, u3 doesn't recognize ::arc where u4 does.
u3 displays memory locations with arc::print -a where ::arc -a doesn't work 
for u4.

I posted this into the zfs discussion forum, because this limited u4 
functionality prevents you from dynamically changing the ARC in ZFS by trying 
the ZFS Tuning instructions.


Spencer
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-06 Thread William Fretts-Saxton
I disabled file prefetch and there was no effect.

Here are some performance numbers.  Note that, when the application server used 
a ZFS file system to save its data, the transaction took TWICE as long.  For 
some reason, though, iostat is showing 5x as much disk writing (to the physical 
disks) on the ZFS partition.  Can anyone see a problem here?

-
Average application server client response time (1st run/2nd run):

SVM - 12/18 seconds
ZFS - 35/38 seconds

SVM Performance
---
# iostat -xnz 5
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
  195.1  414.3 1465.9 1657.3  0.0  1.70.02.7   0  98 md/d100
   97.5  414.3  730.2 1657.3  0.0  1.00.01.9   0  74 md/d101
   97.7  414.1  735.8 1656.5  0.0  0.80.01.5   0  59 md/d102
   54.4  203.6  370.7  814.2  0.0  0.50.02.1   0  42 c0t2d0
   52.8  210.6  359.5  842.2  0.0  0.50.01.9   0  40 c0t3d0
   54.0  203.6  374.7  814.2  0.0  0.30.01.2   0  26 c0t4d0
   52.2  210.6  361.1  842.2  0.0  0.50.01.8   0  38 c0t5d0

ZFS Performance
---
# iostat -xnz 5
extended device statistics
r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   23.2  148.8 1496.7 3806.8  0.0  2.50.0   14.7   0  21 c0t2d0
   22.8  148.8 1470.9 3806.8  0.0  2.40.0   13.9   0  22 c0t3d0
   24.2  149.0 1561.1 3805.0  0.0  1.50.08.6   0  18 c0t4d0
   23.4  149.4 1509.6 3805.0  0.0  2.50.0   14.7   0  25 c0t5d0

# zpool iostat 5
   capacity operationsbandwidth
pool used  avail   read  write   read  write
--  -  -  -  -  -  -
pool1   5.69G   266G 12243   775K  7.20M
pool1   5.69G   266G 88232  5.53M  7.12M
pool1   5.69G   266G 78216  4.87M  6.81M
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS configuration for a thumper

2008-02-06 Thread eric kustarz

On Feb 4, 2008, at 5:10 PM, Marion Hakanson wrote:

 [EMAIL PROTECTED] said:
 FYI, you can use the '-c' option to compare results from various  
 runs   and
 have one single report to look at.

 That's a handy feature.  I've added a couple of such comparisons:
   http://acc.ohsu.edu/~hakansom/thumper_bench.html

 Marion



Your finding for random reads with or without NCQ match my findings:
http://blogs.sun.com/erickustarz/entry/ncq_performance_analysis

Disabling NCQ looks like a very tiny win for the multi-stream read  
case.  I found a much bigger win, but i was doing RAID-0 instead of  
RAID-Z.

eric

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] available space?

2008-02-06 Thread Richard Elling
Jure Pečar wrote:
 Maybe a basic zfs question ...

 I have a pool:

 # zpool status backup
   pool: backup
  state: ONLINE
  scrub: none requested
 config:

 NAME  STATE READ WRITE CKSUM
 backupONLINE   0 0 0
   mirror  ONLINE   0 0 0
 c1t0d0s1  ONLINE   0 0 0
 c2t0d0s1  ONLINE   0 0 0
   raidz2  ONLINE   0 0 0
 c1t1d0ONLINE   0 0 0
 c1t2d0ONLINE   0 0 0
 c1t3d0ONLINE   0 0 0
 c1t4d0ONLINE   0 0 0
 c1t5d0ONLINE   0 0 0
 c1t6d0ONLINE   0 0 0
 c1t7d0ONLINE   0 0 0
 c2t1d0ONLINE   0 0 0
 c2t2d0ONLINE   0 0 0
 c2t3d0ONLINE   0 0 0
 c2t4d0ONLINE   0 0 0
 c2t5d0ONLINE   0 0 0
 c2t6d0ONLINE   0 0 0
 c2t7d0ONLINE   0 0 0

 For which list reports:

 # zpool list backup
 NAME SIZE   USED  AVAILCAP  HEALTH  ALTROOT
 backup  13.5T   434K  13.5T 0%  ONLINE  -


 Yet df and zfs list shows something else:

 [EMAIL PROTECTED]:~# zfs list
 NAME USED  AVAIL  REFER  MOUNTPOINT
 backup   262K  11.5T  1.78K  none
 backup/files32.0K  11.5T  32.0K  /export/files
 ...

 # df -h
 Filesystem size   used  avail capacity  Mounted on
 ...
 backup/files11T32K11T 1%/export/files


 Why does AVAIL differ for such a large amount?
   

They represent two different things. See the man pages for
zpool and zfs for a description of their meanings.
 -- richard


 (NexentaOS_20080131)

   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] zfs send / receive between different opensolaris versions?

2008-02-06 Thread Michael Hale
Hello everybody,

I'm thinking of building out a second machine as a backup for our mail  
spool where I push out regular filesystem snapshots, something like a  
warm/hot spare situation.

Our mail spool is currently running snv_67 and the new machine would  
probably be running whatever the latest opensolaris version is (snv_77  
or later).

My first question is whether or not zfs send / receive is portable  
between differing releases of opensolaris.  My second question (kind  
of off topic for this list) is that I was wondering the difficulty  
involved in upgrading snv_67 to a later version of opensolaris given  
that we're running a zfs root boot configuration
--
Michael Hale[EMAIL 
PROTECTED] 
 
Manager of Engineering Support  Enterprise Engineering 
Group
Transcom Enhanced Services  
http://www.transcomus.com





___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-06 Thread Will Murnane
On Feb 6, 2008 6:36 PM, William Fretts-Saxton
[EMAIL PROTECTED] wrote:
 Here are some performance numbers.  Note that, when the
 application server used a ZFS file system to save its data, the
 transaction took TWICE as long.  For some reason, though, iostat is
 showing 5x as much disk writing (to the physical disks) on the ZFS
 partition.  Can anyone see a problem here?
What is the disk layout of the zpool in question?  Striped?  Mirrored?
 Raidz?  I would suggest either a simple stripe or striping+mirroring
as the best-performing layout.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-06 Thread William Fretts-Saxton
It is a striped/mirror:

 # zpool status
NAMESTATE READ WRITE CKSUM
pool1   ONLINE   0 0 0
  mirrorONLINE   0 0 0
c0t2d0  ONLINE   0 0 0
c0t3d0  ONLINE   0 0 0
  mirrorONLINE   0 0 0
c0t4d0  ONLINE   0 0 0
c0t5d0  ONLINE   0 0 0
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] scrub halts

2008-02-06 Thread Lida Horn
I now have a improved sata and marvell88sx driver modules that
deal with various error conditions in a much more solid way.
Changes include reducing the number of required device resets,
properly reporting media errors (rather than no additional sense),
clearing aborted packets more rapidly so that after an hardware error
progress is again made much more quickly.  Further the driver is
much quieter (far fewer messages in /var/adm/messages).

If there is still interest, I can make those binaries available for testing
prior to their availability in Solaris Nevada (OpenSolaris).  These changes
will be checked in soon, but the process always inserts a significant delay, so
if anyone would like, please e-mail me and I will make those binaries
available via e-mail.

Regards,
Lida
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-06 Thread Vincent Fox
Solaris 10u4 eh?

Sounds a lot like fsync issues we want into, trying to run Cyrus mail-server 
spools in ZFS.

This was highlighted for us by the filebench software varmail test.

OpenSolaris nv78 however worked very well.
 
 
This message posted from opensolaris.org
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-06 Thread Marc Bevand
William Fretts-Saxton william.fretts.saxton at sun.com writes:
 
 I disabled file prefetch and there was no effect.
 
 Here are some performance numbers.  Note that, when the application server
 used a ZFS file system to save its data, the transaction took TWICE as long.
 For some reason, though, iostat is showing 5x as much disk
 writing (to the physical disks) on the ZFS partition.  Can anyone see a
 problem here?

Possible explanation: the Glassfish applications are using synchronous
writes, causing the ZIL (ZFS Intent Log) to be intensively used, which
leads to a lot of extra I/O. Try to disable it:

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29

Since disabling it is not recommended, if you find out it is the cause of your
perf problems, you should instead try to use a SLOG (separate intent log, see
above link). Unfortunately your OS version (Solaris 10 8/07) doesn't support
SLOGs, they have only been added to OpenSolaris build snv_68:

http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] MySQL, Lustre and ZFS

2008-02-06 Thread kilamanjaro
Hi all, Any thoughts on if and when ZFS, MySQL, and Lustre 1.8 (and  
beyond) will work together and be supported so by Sun?

- Network Systems Architect
   Advanced Digital Systems Internet 
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-06 Thread Neil Perrin
Marc Bevand wrote:
 William Fretts-Saxton william.fretts.saxton at sun.com writes:
   
 I disabled file prefetch and there was no effect.

 Here are some performance numbers.  Note that, when the application server
 used a ZFS file system to save its data, the transaction took TWICE as long.
 For some reason, though, iostat is showing 5x as much disk
 writing (to the physical disks) on the ZFS partition.  Can anyone see a
 problem here?
 

 Possible explanation: the Glassfish applications are using synchronous
 writes, causing the ZIL (ZFS Intent Log) to be intensively used, which
 leads to a lot of extra I/O.

The ZIL doesn't do a lot of extra IO. It usually just does one write per 
synchronous request and will batch
up multiple writes into the same log block if possible. However, it does 
need to wait for the
writes to be on stable storage before returning to the application, 
which is what the application has
requested. It does this by waiting for the write to complete and then 
flushing the disk write cache.
If the write cache is battery backed for all zpool devices then the 
global zfs_nocacheflush can be set
to give dramatically better performance.
  Try to disable it:

 http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Disabling_the_ZIL_.28Don.27t.29

 Since disabling it is not recommended, if you find out it is the cause of your
 perf problems, you should instead try to use a SLOG (separate intent log, see
 above link). Unfortunately your OS version (Solaris 10 8/07) doesn't support
 SLOGs, they have only been added to OpenSolaris build snv_68:

 http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on

 -marc

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
   

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS configuration for a thumper

2008-02-06 Thread Marion Hakanson
[EMAIL PROTECTED] said:
 Your finding for random reads with or without NCQ match my findings: http://
 blogs.sun.com/erickustarz/entry/ncq_performance_analysis
 
 Disabling NCQ looks like a very tiny win for the multi-stream read   case.  I
 found a much bigger win, but i was doing RAID-0 instead of   RAID-Z. 

I didn't set out to do the with/without NCQ comparisons.  Rather, my
first runs of filebench and bonnie++ triggered a number of I/O errors
and controller timeout/resets on several different drives, so I disabled
NCQ based on bug 6587133's workaround suggestion.  No more errors
during subsequent testing, so we're running with NCQ disabled until
a patch comes along.

It was useful, however, to see what effect disabling NCQ had.  I find
filebench easier to use than bonnie++, mostly because filebench is
automatically multithreaded, which is necessary to generate a heavy
enough workload to exercise anything more than a few drives (esp.
on machines like T2000's).  The HTML output doesn't hurt, either.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-06 Thread Marion Hakanson
[EMAIL PROTECTED] said:
 Here are some performance numbers.  Note that, when the application server
 used a ZFS file system to save its data, the transaction took TWICE as long.
 For some reason, though, iostat is showing 5x as much disk writing (to the
 physical disks) on the ZFS partition.  Can anyone see a problem here? 

I'm not familiar with the application in use here, but your iostat numbers
remind me of something I saw during small overwrite tests on ZFS.  Even
though the test was doing only writing, because it was writing over only a
small part of existing blocks, ZFS had to read (the unchanged part of) each
old block in before writing out the changed block to a new location (COW).

This is a case where you want to set the ZFS recordsize to match your
application's typical write size, in order to avoid the read overhead
inherent in partial-block updates.  UFS by default has a smaller max
blocksize than ZFS' default 128k, so in addition to the ZIL/fsync issue
UFS will also suffer less overhead from such partial-block updates.

Again, this may not be what's going on, but it's worth checking if you
haven't already done so.

Regards,

Marion


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Performance Issue

2008-02-06 Thread Marc Bevand
Neil Perrin Neil.Perrin at Sun.COM writes:
 
 The ZIL doesn't do a lot of extra IO. It usually just does one write per 
 synchronous request and will batch up multiple writes into the same log
 block if possible.

Ok. I was wrong then. Well, William, I think Marion Hakanson has the
most plausible explanation. As he suggests, experiment with zfs set
recordsize=XXX to force the filesystem to use small records. See
the zfs(1) manpage.

-marc

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS taking up to 80 seconds to flush a single 8KB O_SYNC block.

2008-02-06 Thread Nathan Kroenert
Hey all -

I'm working on an interesting issue where I'm seeing ZFS being quite 
cranky about writing O_SYNC written blocks.

Bottom line is that I have a small test case that does essentially this:

open file for writing  -- O_SYNC
loop(
write() 8KB of random data
print time taken to write data
}

It's taking anywhere up to 80 seconds per 8KB block. When the 'problem' 
is not in evidence, (and it's not always happening), I can do around 
1200 O_SYNC writes per second...

It seems to be waiting here virtually all of the time:

  0t11021::pid2proc | ::print proc_t p_tlist|::findstack -v
stack pointer for thread 30171352960: 2a118052df1
[ 02a118052df1 cv_wait+0x38() ]
   02a118052ea1 zil_commit+0x44(1, 6b50516, 193, 60005db66bc, 6b50570,
   60005db6640)
   02a118052f51 zfs_write+0x554(0, 14000, 2a1180539e8, 6000af22840, 
2000,
   2a1180539d8)
   02a118053071 fop_write+0x20(304898cd100, 2a1180539d8, 10, 
300a27a9e48, 0,
   7b7462d0)
   02a118053121 write+0x268(4, 8058, 60051a3d738, 2000, 113, 1)
   02a118053221 dtrace_systrace_syscall32+0xac(4, ffbfdaf0, 2000, 21e80,
   ff3a00c0, ff3a0100)
   02a1180532e1 syscall_trap32+0xcc(4, ffbfdaf0, 2000, 21e80, ff3a00c0,
   ff3a0100)

And this also evident in a dtrace of it, following the write in...

...
  28- zil_commit
  28  - cv_wait
  28- thread_lock
  28- thread_lock
  28- cv_block
  28  - ts_sleep
  28  - ts_sleep
  28  - new_mstate
  28- cpu_update_pct
  28  - cpu_grow
  28- cpu_decay
  28  - exp_x
  28  - exp_x
  28- cpu_decay
  28  - cpu_grow
  28- cpu_update_pct
  28  - new_mstate
  28  - disp_lock_enter_high
  28  - disp_lock_enter_high
  28  - disp_lock_exit_high
  28  - disp_lock_exit_high
  28- cv_block
  28- sleepq_insert
  28- sleepq_insert
  28- disp_lock_exit_nopreempt
  28- disp_lock_exit_nopreempt
  28- swtch
  28  - disp
  28- disp_lock_enter
  28- disp_lock_enter
  28- disp_lock_exit
  28- disp_lock_exit
  28- disp_getwork
  28- disp_getwork
  28- restore_mstate
  28- restore_mstate
  28  - disp
  28  - pg_cmt_load
  28  - pg_cmt_load
  28- swtch
  28- resume
  28  - savectx
  28- schedctl_save
  28- schedctl_save
  28  - savectx
...

At this point, it waits for up to 80 seconds.

I'm also seeing zil_commit() being called around 7-15 times per second.

For kicks, I disabled the ZIL: zil_disable/W0t1, and that made not a 
pinch of difference. :)

For what it's worth, this is a T2000, running Oracle, connected to an 
HDS 9990 (using 2GB fibre), with 8KB record sizes for the oracle 
filesystems, and I'm only seeing the issue on the ZFS filesystems that 
have the active oracle tables on them.

The O_SYNC test case is just trying to help me understand what's 
happening. The *real* problem is that oracle is running like rubbish 
when it's trying to roll forward archive logs from another server. It's 
an almost 100% write workload. At the moment, it cannot even keep up 
with the other server's log creation rate, and it's barely doing 
anything. (The other box is quite different, so not really valid for 
direct comparison at this point).

6513020 looked interesting for a while, but I already have 120011-14 and 
127111-03 and installed.

I'm looking into the cache flush settings of the 9990 array to see if 
it's that killing me, but I'm also looking for any other ideas on what 
might be hurting me.

I also have set
zfs:zfs_nocacheflush = 1
in /etc/system

The Oracle Logs are on a separate Zpool and I'm not seeing the issue on 
those filesystems.

The lockstats I have run are not yet all that interesting. If anyone has 
ideas on specific incantations I should use or some specific D or 
anything else, I'd be most appreciative.

Cheers!

Nathan.

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss