Re: [zfs-discuss] Poor read/write performance when using ZFS iSCSI target

2008-08-18 Thread Roch - PAE

initiator_host:~ # dd if=/dev/zero bs=1k of=/dev/dsk/c5t0d0 
count=100

So this is going at 3000 x 1K writes per second or 
330usec per write. The iscsi target is probably doing a
over the wire operation for each request. So it looks fine
at first glance. 

-r


Cody Campbell writes:
  Greetings,
  
  I want to take advantage of the iSCSI target support in the latest
  release (svn_91) of OpenSolaris, and I'm running into some performance
  problems when reading/writing from/to my target. I'm including as much
  detail as I can so bear with me here... 
  
  I've built an x86 OpenSolaris server (Intel Xeon running NV_91) with a
  zpool of 15 750GB SATA disks, of which I've created and exported a ZFS
  Volume with the shareiscsi=on property set to generate an iSCSI
  target.   
  
  My problem is, when I connect to this target from any initiator
  (tested with both Linux 2.6 and OpenSolaris NV_91 SPARC and x86), the
  read/write speed is dreadful (~ 3 megabytes / second!).  When I test
  read/write performance locally with the backing pool, I have excellent
  speeds. The same can be said when I use services such as NFS and FTP
  to move files between other hosts on the network and the volume I am
  exporting as a Target. When doing this I have achieved the
  near-Gigabit speeds I expect, which has me thinking this isn't a
  network problem of some sort (I've already disabled the Neagle
  algorithm if you're wondering). It's not until I add the iSCSI target
  to the stack that the speeds go south, so I am concerned that I may be
  missing something in configuration of the target.  
  
  Below are some details pertaining to my configuration. 
  
  OpenSolaris iSCSI Target Host:
  
  target_host:~ # zpool status pool0
pool: pool0
   state: ONLINE
   scrub: none requested
  config:
  
  NAMESTATE READ WRITE CKSUM
  pool0   ONLINE   0 0 0
raidz1ONLINE   0 0 0
  c0t0d0  ONLINE   0 0 0
  c0t1d0  ONLINE   0 0 0
  c0t2d0  ONLINE   0 0 0
  c0t3d0  ONLINE   0 0 0
  c0t4d0  ONLINE   0 0 0
  c0t5d0  ONLINE   0 0 0
  c0t6d0  ONLINE   0 0 0
raidz1ONLINE   0 0 0
  c0t7d0  ONLINE   0 0 0
  c1t0d0  ONLINE   0 0 0
  c1t1d0  ONLINE   0 0 0
  c1t2d0  ONLINE   0 0 0
  c1t3d0  ONLINE   0 0 0
  c1t4d0  ONLINE   0 0 0
  c1t5d0  ONLINE   0 0 0
  spares
c1t6d0AVAIL   
  
  errors: No known data errors
  
  target_host:~ # zfs get all pool0/vol0
  NAMEPROPERTY VALUE  SOURCE
  pool0/vol0  type volume -
  pool0/vol0  creation Wed Jul  2 18:16 2008  -
  pool0/vol0  used 5T -
  pool0/vol0  available7.92T  -
  pool0/vol0  referenced   34.2G  -
  pool0/vol0  compressratio1.00x  -
  pool0/vol0  reservation  none   default
  pool0/vol0  volsize  5T -
  pool0/vol0  volblocksize 8K -
  pool0/vol0  checksum on default
  pool0/vol0  compression  offdefault
  pool0/vol0  readonly offdefault
  pool0/vol0  shareiscsi   on local
  pool0/vol0  copies   1  default
  pool0/vol0  refreservation   5T local
  
  
  target_host:~ # iscsitadm list target -v pool0/vol0
  
  Target: pool0/vol0
  iSCSI Name: iqn.1986-03.com.sun:02:fb1c7071-8f35-eb03-9efb-b950d5bdd1ab
  Alias: pool0/vol0
  Connections: 1
  Initiator:
  iSCSI Name: iqn.1986-03.com.sun:01:0003ba681e7f.486c0829
  Alias: unknown
  ACL list:
  TPGT list:
  TPGT: 1
  LUN information:
  LUN: 0
  GUID: 01304865b1b42a00486c29d2
  VID: SUN
  PID: SOLARIS
  Type: disk
  Size: 5.0T
  Backing store: /dev/zvol/rdsk/pool0/vol0
  Status: online
  
  
  OpenSolaris iSCSI Initiator Host:
  
  
  initiator_host:~ # iscsiadm list target -vS 
  iqn.1986-03.com.sun:02:fb1c7071-8f35-eb03-9efb-b950d5bdd1ab
  Target: iqn.1986-03.com.sun:02:fb1c7071-8f35-eb03-9efb-b950d5bdd1ab
  Alias: pool0/vol0
  TPGT: 1
  ISID: 402a
  Connections: 1
  CID: 0
IP address (Local): 192.168.4.2:63960
IP address (Peer): 192.168.4.3:3260
Discovery Method: SendTargets 
Login Parameters (Negotiated):
 

Re: [zfs-discuss] Why RAID 5 stops working in 2009

2008-08-18 Thread Roch - PAE
Kyle McDonald writes:
  Ross wrote:
   Just re-read that and it's badly phrased.  What I meant to say is that a 
   raid-z / raid-5 array based on 500GB drives seems to have around a 1 in 10 
   chance of loosing some data during a full rebuild.


 
  Actually, I think it's been explained already why this is actually one 
  area where RAID-Z will really start to show some of the was it's 
  different than it's RAID-5 ancestors. For one, A RAID-5 controller has 
  no idea of the filesystem, and there for has to rebuild every bit on the 
  disk, whether it's used or not, and if it cant' it will declare the 
  whole array unusable. RAID-Z on the other hand since it is integrated 
  with the filesystem, only needs to rebuild the *used* data, and won't 
  care if unused parts of the disks can't be rebuilt.
  
  Second, a factor that the author of that article leaves out is that 
  decent RAID-5, and RAID-Z can do 'scrubs' of the data at regular 
  intervals, and this will many times catch and deal with these read 
  problems well before they have a chance to take all your data with them. 
  The types of errors the author writes about many times are caused by how 
  accurately the block was written and not a defect of the media, so many 
  times they can be fixed by just rewriting the data to the same block. On 
  ZFS this will almost never happen, because of COW it will always choose 
  a new block to write to. I don't think many (if any) RAID-5 
  implementaions can change the location of data on a drive.
  

Moreover, ZFS stores redundant copies of metadata so even if
a full raid-z stripe  goes south, we  can  still rebuild  most of
pool data. It seems that at  worst, such double failures would lead
to a handful of un-recovered files.

-r


   -Kyle
  
   This message posted from opensolaris.org
   ___
   zfs-discuss mailing list
   zfs-discuss@opensolaris.org
   http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs_nocacheflush

2008-07-31 Thread Roch - PAE
Peter Tribble writes:
  A question regarding zfs_nocacheflush:
  
  The Evil Tuning Guide says to only enable this if every device is
  protected by NVRAM.
  
  However, is it safe to enable zfs_nocacheflush when I also have
  local drives (the internal system drives) using ZFS, in particular if
  the write cache is disabled on those drives?
  
  What I have is a local zfs pool from the free space on the internal
  drives, so I'm only using a partition and the drive's write cache
  should be off, so my theory here is that zfs_nocacheflush shouldn't
  have any effect because there's no drive cache in use...
  

Seems plausible, But I'd check that the caches are indeed
off using format -e.

-r

  -- 
  -Peter Tribble
  http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Periodic flush

2008-07-01 Thread Roch - PAE
Robert Milkowski writes:

  Hello Roch,
  
  Saturday, June 28, 2008, 11:25:17 AM, you wrote:
  
  
  RB I suspect,  a single dd is cpu bound.
  
  I don't think so.
  

We're nearly so as you show. More below.

  Se below one with a stripe of 48x disks again. Single dd with 1024k
  block size and 64GB to write.
  
  bash-3.2# zpool iostat 1
 capacity operationsbandwidth
  pool used  avail   read  write   read  write
  --  -  -  -  -  -  -
  test 333K  21.7T  1  1   147K   147K
  test 333K  21.7T  0  0  0  0
  test 333K  21.7T  0  0  0  0
  test 333K  21.7T  0  0  0  0
  test 333K  21.7T  0  0  0  0
  test 333K  21.7T  0  0  0  0
  test 333K  21.7T  0  0  0  0
  test 333K  21.7T  0  0  0  0
  test 333K  21.7T  0  1.60K  0   204M
  test 333K  21.7T  0  20.5K  0  2.55G
  test4.00G  21.7T  0  9.19K  0  1.13G
  test4.00G  21.7T  0  0  0  0
  test4.00G  21.7T  0  1.78K  0   228M
  test4.00G  21.7T  0  12.5K  0  1.55G
  test7.99G  21.7T  0  16.2K  0  2.01G
  test7.99G  21.7T  0  0  0  0
  test7.99G  21.7T  0  13.4K  0  1.68G
  test12.0G  21.7T  0  4.31K  0   530M
  test12.0G  21.7T  0  0  0  0
  test12.0G  21.7T  0  6.91K  0   882M
  test12.0G  21.7T  0  21.8K  0  2.72G
  test16.0G  21.7T  0839  0  88.4M
  test16.0G  21.7T  0  0  0  0
  test16.0G  21.7T  0  4.42K  0   565M
  test16.0G  21.7T  0  18.5K  0  2.31G
  test20.0G  21.7T  0  8.87K  0  1.10G
  test20.0G  21.7T  0  0  0  0
  test20.0G  21.7T  0  12.2K  0  1.52G
  test24.0G  21.7T  0  9.28K  0  1.14G
  test24.0G  21.7T  0  0  0  0
  test24.0G  21.7T  0  0  0  0
  test24.0G  21.7T  0  0  0  0
  test24.0G  21.7T  0  14.5K  0  1.81G
  test28.0G  21.7T  0  10.1K  63.6K  1.25G
  test28.0G  21.7T  0  0  0  0
  test28.0G  21.7T  0  10.7K  0  1.34G
  test32.0G  21.7T  0  13.6K  63.2K  1.69G
  test32.0G  21.7T  0  0  0  0
  test32.0G  21.7T  0  0  0  0
  test32.0G  21.7T  0  11.1K  0  1.39G
  test36.0G  21.7T  0  19.9K  0  2.48G
  test36.0G  21.7T  0  0  0  0
  test36.0G  21.7T  0  0  0  0
  test36.0G  21.7T  0  17.7K  0  2.21G
  test40.0G  21.7T  0  5.42K  63.1K   680M
  test40.0G  21.7T  0  0  0  0
  test40.0G  21.7T  0  6.62K  0   844M
  test44.0G  21.7T  1  19.8K   125K  2.46G
  test44.0G  21.7T  0  0  0  0
  test44.0G  21.7T  0  0  0  0
  test44.0G  21.7T  0  18.0K  0  2.24G
  test47.9G  21.7T  1  13.2K   127K  1.63G
  test47.9G  21.7T  0  0  0  0
  test47.9G  21.7T  0  0  0  0
  test47.9G  21.7T  0  15.6K  0  1.94G
  test47.9G  21.7T  1  16.1K   126K  1.99G
  test51.9G  21.7T  0  0  0  0
  test51.9G  21.7T  0  0  0  0
  test51.9G  21.7T  0  14.2K  0  1.77G
  test55.9G  21.7T  0  14.0K  63.2K  1.73G
  test55.9G  21.7T  0  0  0  0
  test55.9G  21.7T  0  0  0  0
  test55.9G  21.7T  0  16.3K  0  2.04G
  test59.9G  21.7T  0  14.5K  63.2K  1.80G
  test59.9G  21.7T  0  0  0  0
  test59.9G  21.7T  0  0  0  0
  test59.9G  21.7T  0  17.7K  0  2.21G
  test63.9G  21.7T  0  4.84K  62.6K   603M
  test63.9G  21.7T  0  0  0  0
  test63.9G  21.7T  0  0  0  0
  test63.9G  21.7T  0  0  0  0
  test63.9G  21.7T  0  0  0  0
  test63.9G  21.7T  0  0  0  0
  test63.9G  21.7T  0  0  0  0
  test63.9G  21.7T  0  0  0  0
  ^C
  bash-3.2#
  
  bash-3.2# ptime dd if=/dev/zero of=/test/q1 bs=1024k count=65536
  65536+0 records in
  65536+0 records out
  
  real 1:06.312
  user0.074
  sys54.060
  bash-3.2#
  
  Doesn't look like it's CPU bound.
  

So if sys we're at 81%  of CPU saturation. If you make this
100% you will still have zeros in the zpool iostat.

We 

Re: [zfs-discuss] Periodic flush

2008-05-15 Thread Roch - PAE
Bob Friesenhahn writes:
  On Tue, 15 Apr 2008, Mark Maybee wrote:
   going to take 12sec to get this data onto the disk.  This impedance
   mis-match is going to manifest as pauses:  the application fills
   the pipe, then waits for the pipe to empty, then starts writing again.
   Note that this won't be smooth, since we need to complete an entire
   sync phase before allowing things to progress.  So you can end up
   with IO gaps.  This is probably what the original submitter is
  
  Yes.  With an application which also needs to make best use of 
  available CPU, these I/O gaps cut into available CPU time (by 
  blocking the process) unless the application uses multithreading and 
  an intermediate write queue (more memory) to separate the CPU-centric 
  parts from the I/O-centric parts.  While the single-threaded 
  application is waiting for data to be written, it is not able to read 
  and process more data.  Since reads take time to complete, being 
  blocked on write stops new reads from being started so the data is 
  ready when it is needed.
  
   There is one down side to this new model: if a write load is very
   bursty, e.g., a large 5GB write followed by 30secs of idle, the
   new code may be less efficient than the old.  In the old code, all
  
  This is also a common scenario. :-)
  
  Presumably the special slow I/O code would not kick in unless the 
  burst was large enough to fill quite a bit of the ARC.
  

Bursts of 1/8th of physical memory or 5 seconds of storage
throughput whichever is smallest.

-r



  Real time throttling is quite a challenge to do in software.
  
  Bob
  ==
  Bob Friesenhahn
  [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
  GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Performance with Sun StorageTek 2540

2008-02-18 Thread Roch - PAE

Bob Friesenhahn writes:

  On Fri, 15 Feb 2008, Roch Bourbonnais wrote:
   What was the interlace on the LUN ?
  
   The question was about LUN  interlace not interface.
   128K to 1M works better.
  
  The segment size is set to 128K.  The max the 2540 allows is 512K. 
  Unfortunately, the StorageTek 2540 and CAM documentation does not 
  really define what segment size means.
  
   Any compression ?
  
  Compression is disabled.
  
   Does turn off checksum helps the number (that would point to a CPU limited 
   throughput).
  
  I have not tried that but this system is loafing during the benchmark. 
  It has four 3GHz Opteron cores.
  
  Does this output from 'iostat -xnz 20' help to understand issues?
  
   extended device statistics
   r/sw/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b device
   3.00.7   26.43.5  0.0  0.00.04.2   0   2 c1t1d0
   0.0  154.20.0 19680.3  0.0 20.70.0  134.2   0  59 
  c4t600A0B80003A8A0B096147B451BEd0
   0.0  211.50.0 26940.5  1.1 33.95.0  160.5  99 100 
  c4t600A0B800039C9B50A9C47B4522Dd0
   0.0  211.50.0 26940.6  1.1 33.95.0  160.4  99 100 
  c4t600A0B800039C9B50AA047B4529Bd0
   0.0  154.00.0 19654.7  0.0 20.70.0  134.2   0  59 
  c4t600A0B80003A8A0B096647B453CEd0
   0.0  211.30.0 26915.0  1.1 33.95.0  160.5  99 100 
  c4t600A0B800039C9B50AA447B4544Fd0
   0.0  152.40.0 19447.0  0.0 20.50.0  134.5   0  59 
  c4t600A0B80003A8A0B096A47B4559Ed0
   0.0  213.20.0 27183.8  0.9 34.14.2  159.9  90 100 
  c4t600A0B800039C9B50AA847B45605d0
   0.0  152.50.0 19453.4  0.0 20.50.0  134.5   0  59 
  c4t600A0B80003A8A0B096E47B456DAd0
   0.0  213.20.0 27177.4  0.9 34.14.2  159.9  90 100 
  c4t600A0B800039C9B50AAC47B45739d0
   0.0  213.20.0 27195.3  0.9 34.14.2  159.9  90 100 
  c4t600A0B800039C9B50AB047B457ADd0
   0.0  154.40.0 19711.8  0.0 20.70.0  134.0   0  59 
  c4t600A0B80003A8A0B097347B457D4d0
   0.0  211.30.0 26958.6  1.1 33.95.0  160.6  99 100 
  c4t600A0B800039C9B50AB447B4595Fd0
  

Interesting that a subset of 5 disks are responding faster
(which also leads to smaller actv queues and so lower
service times) than the 7 others.



and the slow ones are subject to more writes...haha.

If the sizes of the luns are different (or have different
amount of free blocks) then maybe ZFS is now trying to rebalance
free space by targetting a subset of the disks with more 
new data.  Pool throughput will be impacted by this.


-r





  Bob
  ==
  Bob Friesenhahn
  [EMAIL PROTECTED], http://www.simplesystems.org/users/bfriesen/
  GraphicsMagick Maintainer,http://www.GraphicsMagick.org/
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] UFS on zvol Cache Questions...

2008-02-11 Thread Roch - PAE

Priming the cache for ZFS should work at least after boot
When freemem is large; any read block will make it to
cache. Post boot when memory is primed with something else
(what?) then it gets more difficult for both UFS and ZFS to
guess what to keep in caches.

Did you try priming ZFS after boot ?

So next you seem to suffer because your sequential write to
log files appear to displace from the ARC the more useful DB 
files (I'd be interested to see if this still occur after
you've primed the ZFS cache after boot).

Note that  if your logfile rate is  huge  (dd like) then ZFS
cache management will suffer but that is well on it's way to
being fixed. But  for DS, I would  think  that the log  rate
would be  more reasonable and that  your storage  is able to
keep up. That gives ZFS cache management a fighting chance
to store the reused data over sequential writes.

If the default  behavior is not working  for you, we'll need
to consider the ARC  behavior in this case.  I don't see why
it  should not work out of  the box. But manual control will
come also in the form of this DIO like feature :

6429855  Need way to tell ZFS that caching is a lost cause

While  we manage to try and  solve your problem out the box,
you might also have a  background process that keeps priming
the cache at a  low  I/O rate.  Not a great  workaround, but
should be effective.


-r



Brad Diggs writes:
  Hello Darren,
  
  Please find responses in line below...
  
  On Fri, 2008-02-08 at 10:52 +, Darren J Moffat wrote:
   Brad Diggs wrote:
I would like to use ZFS but with ZFS I cannot prime the cache
and I don't have the ability to control what is in the cache 
(e.g. like with the directio UFS option).
   
   Why do you believe you need that at all ?  
  
  My application is directory server.  The #1 resource that 
  directory needs to make maximum utilization of is RAM.  In 
  order to do that, I want to control every aspect of RAM
  utilization both to safely use as much RAM as possible AND
  avoid contention among things trying to use RAM.
  
  Lets consider the following example.  A customer has a 
  50M entry directory.  The sum of the data (db3 files) is
  approximately 60GB.  However, there is another 2GB for the
  root filesystem, 30GB for the changelog, 1GB for the 
  transaction logs, and 10GB for the informational logs.
  
  The system on which directory server will run has only 
  64GB of RAM.  The system is configured with the following
  partitions:
  
FS  Used(GB)  Description
 /  2 root
 /db60directory data
 /logs  41changelog, txn logs, and info logs
 swap   10system swap
  
  I prefer to keep the directory db cache and entry caches
  relatively small.  So the db cache is 2GB and the entry 
  cache is 100M.  This leaves roughly 63GB of RAM for my 60GB
  of directory data and Solaris. The only way to ensure that
  the directory data (/db) is the only thing in the filesystem
  cache is to set directio on / (root) and (/logs).
  
   What do you do to prime the cache with UFS 
  
  cd ds_instance_dir/db
  for i in `find . -name '*.db3`
  do
dd if=${i} of=/dev/null
  done
  
   and what benefit do you think it is giving you ?
  
  Priming the directory server data into filesystem cache 
  reduces ldap response time for directory data in the
  filesystem cache.  This could mean the difference between
  a sub ms response time and a response time on the order of
  tens or hundreds of ms depending on the underlying storage
  speed.  For telcos in particular, minimal response time is 
  paramount.
  
  Another common scenario is when we do benchmark bakeoffs
  with another vendor's product.  If the data isn't pre-
  primed, then ldap response time and throughput will be
  artificially degraded until the data is primed into either
  the filesystem or directory (db or entry) cache.  Priming
  via ldap operations can take many hours or even days 
  depending on the number of entries in the directory server.
  However, priming the same data via dd takes minutes to hours
  depending on the size of the files.  
  
  As you know in benchmarking scenarios, time is the most limited
  resource that we typically have.  Thus, priming via dd is much
  preferred.
  
  Lastly, in order to achieve optimal use of available RAM, we
  use directio for the root (/) and other non-data filesystems.
  This makes certain that the only data in the filesystem cache
  is the directory data.
  
   Have you tried just using ZFS and found it doesn't perform as you need 
   or are you assuming it won't because it doesn't have directio ?
  
  We have done extensive testing with ZFS and love it.  The three 
  areas lacking for our use cases are as follows:
   * No ability to control what is in cache. e.g. no directio
   * No absolute ability to apply an upper boundary to the amount
 of RAM consumed by ZFS.  I know that the arc cache has a 
 control that 

Re: [zfs-discuss] simulating directio on zfs?

2008-02-04 Thread Roch - PAE

Andrew Robb writes:
  The big problem that I have with non-directio is that buffering delays
  program execution. When reading/writing files that are many times
  larger than RAM without directio, it is very apparent that system
  response drops through the floor- it can take several minutes for an
  ssh login to prompt for a password. This is true both for UFS and
  ZFS. 

Appart from the ZFS write, I find this a bit surprising. Are 
you sure this is a general statement or would detail of your 
configuration be of interest here. As for the ZFS write,
this problem is well on it's way to being fixed.

6429205 each zpool needs to monitor it's throughput and throttle heavy 
writers

Now that  you say this, I  think I can  see how  it would be
possible in the UFS write case also. Both Read cases I find
troublesome though. See below on disable speculative reads.

For ZFS, I guess some relief might come from implementing something
like this :

6429855 Need way to  tell ZFS that caching is a lost cause


  

  Repeat the exercise with directio on UFS and there is no discernible
  delay in starting new applications (ssh login takes a second or so to
  prompt for a password). Writing a large file might appear to take a
  few seconds longer with directio, but add a sync command to the end
  and it is apparent that there is no real delay in getting the data to
  disc with directio. 
  
  I'd like to see directio() provide some of the facilities under ZFS that it 
  affords under UFS:
  1. data is not buffered beyond the I/O request
  2. no speculative reads
  3. synchronous writes of whole records
  4. concurrent I/O (which is already available in ZFS)
  

So I think we should not confuse UFS directio with
synchronous semantics. So I think point 3 comes from a
confusion.

For point  2, we've  not  too long  ago fixed  one  level of
speculative reads  (vdev)  which should not   cause problems
anymore.  The other level (zfetch) needs attention. I see no
reason that good software can't work out of the box for you.
In the mean time it is possible to disable speculative reads
as described here :


http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#File-Level_Prefetching

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Device-Level_Prefetching


For point 1, I think the current behavior of ZFS is far from 
good.  Once the problems are fixed though it will be time to 
reevaluate if the buffering strategies are causing problems and
under what conditions. Your memory BW issues could be one of 
them.

  Note: I consider memory bandwidth a finite and precious resource - not
  to be wasted in double-buffering. (I have a naive test program that is
  completely bound by main memory bandwidth - two programs on two CPUs
  run at half the speed of a single program on one CPU.) 
   

But also consider  you # disks and  system bus bandwidth  in
this equation. Many workloads  won't  be hit by the   double
memory BW if there are spindle limited. A lot of the longing
for Directio  come from a few  serious quirks in the current
implementation. You've had legitimate  issues in UFS and ZFS
and the UFS issues happened to be fixed by UFS/DIO; I think many 
of them can be fixed in ZFS with something not called Directio.


-r





   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL controls in Solaris 10 U4?

2008-01-30 Thread Roch - PAE

Jonathan Loran writes:
  
  Is it true that Solaris 10 u4 does not have any of the nice ZIL controls 
  that exist in the various recent Open Solaris flavors?  I would like to 
  move my ZIL to solid state storage, but I fear I can't do it until I 
  have another update.  Heck, I would be happy to just be able to turn the 
  ZIL off to see how my NFS on ZFS performance is effected before spending 
  the $'s.  Anyone know when will we see this in Solaris 10?
  

You can certainly turn it off with any release (Jim's link).

It's true that S10u4 does not have the Separate Intent Log 
to allow using an SSD for ZIL blocks. I believe S10U5 will
have that feature.

As  noted,   disabling  the  ZIL  won't   lead to   ZFS pool
corruption,   just  DBcorruption(that includes   NFS
clients). To protect against that, in  the event of a server
crash  with zil_disable=1,  you'd  need  to reboot  all  NFS
clients of the server (clear the client's caches) and better
do this before the   server comes back  up  (kind of a   raw
proposition here).

-r


  Thanks,
  
  Jon
  
  -- 
  
  
  - _/ _/  /   - Jonathan Loran -   -
  -/  /   /IT Manager   -
  -  _  /   _  / / Space Sciences Laboratory, UC Berkeley
  -/  / /  (510) 643-5146 [EMAIL PROTECTED]
  - __/__/__/   AST:7731^29u18e3
   
  
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS vq_max_pending value ?

2008-01-23 Thread Roch - PAE

Manoj Nayak writes:
  Hi All.
  
  ZFS document says ZFS schedules it's I/O in such way that it manages to 
  saturate a single disk bandwidth  using enough concurrent 128K I/O.
  The no of concurrent I/O is decided by vq_max_pending.The default value 
  for  vq_max_pending is 35.
  
  We have created 4-disk raid-z group inside ZFS pool on Thumper.ZFS 
  record size is set to 128k.When we read/write a 128K record ,it issue a
  128K/3 I/O to each of the 3 data disks in the 4-disk raid-z group.
  
  We need to saturate all three data disk bandwidth in the Raidz group.Is 
  it required to set vq_max_pending value to 35*3=135  ?
  

Nope.

Once a disk controller is working on 35 requests, we don't
expect to get any more out of it by queueing more requests
and we might even confuse the firmware and get less.

Now for  an array controller and  a vdev  fronting for large
number of disks, then 35 might  be a low number not allowing
full throughput.  Ratherthan tuning 35 up,we suggest
splitting devives into smaller LUNs  since each luns is given
a 35-deep queue. 

Tuning vq_max_pending down helps read and synchronous write
(ZIL) latency. Today the preferred way to help ZIL latency
is to use a Separate Intent Log.

-r


  Thanks
  Manoj Nayak
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS vdev_cache

2008-01-22 Thread Roch - PAE

Manoj Nayak writes:
  Hi All,
  
  If any dtrace script is available to figure out  the vdev_cache (or 
  software track buffer) reads  in kiloBytes ?
  
  The document says the default size of the read is 128k , However 
  vdev_cache source code implementation says the default size is 64k
  
  Thanks
  Manoj Nayak
  

Which document ? It's 64K when it applies.
Nevada won't use the vdev_cache for data block anymore.

-r

  
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS recordsize

2008-01-18 Thread Roch - PAE

Why do you want greater than 128K records.

Do Check out :
http://blogs.sun.com/roch/entry/128k_suffice

-r


Manoj Nayak writes:
  Hi All,
  
  Is it not poosible to increase zfs record size beyond 128k.I am using 
  Solaris 10 Update 4.
  
  I get following error when I try to set zfs record size to 1024 k.
  zfs set recordsize=1024k md9/test
  cannot set property for 'md9/test': 'recordsize' must be power of 2 from 
  512 to 128k
  
  Thanks
  Manoj Nayak
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS recordsize

2008-01-18 Thread Roch - PAE


Manoj Nayak writes:
  Roch - PAE wrote:
   Why do you want greater than 128K records.
 
  A  single-parity RAID-Z pool on thumper is created  it consists of four 
  disk.Solaris 10 update 4 runs on thumper.Then zfs  filesystem is created in
  the pool.1 mb data is written to a file in filesystem  using write  (2) 
  system call.However  dtrace  displays too many small sized physical disk 
  read
  when the same 1 mb data is read using read() system call.recordsize is 
  set to 128k.
  
  What's going on here ? why so many small size block read is done ?
  

Each record becomes it's own raid-z stripe.

When you read a 128K record it will need to issue a 128K/3 I/O
to each of the 3 data disks in the 4-disk raid-z group. But
with prefetching it's possible that the I/O scheduler
aggregate this to a higher value. It seems this is not
happening on your setup. How large are the I/Os, if smaller
than the above, is the pool rather filled up ?

More reading :
http://blogs.sun.com/roch/entry/when_to_and_not_to (raidz)

-r



  if  0LARGE option need to be mentioned at the time of creating this file 
  to tell , so that zfs use bigger block size.
  
  I think, zfs does use block size as per the following statements.
  
  ZFS files smaller than the recordsize are stored using a single 
  filesystem block (FSB) of variable length in multiple of a disk sector 
  (512 Bytes).
  Larger files are stored using multiple FSB, each of recordsize bytes, 
  with default value of 128K.
  
  dtrace output :
  
  Event Device  Device 
  PathName   RW Block Size Block 
  No Offset Path
  sc-read.  R  1052672  0 /mnt/bank0/media/CAD1/4.1
fop_read .  R  1052672  0 /mnt/bank0/media/CAD1/4.1
  disk_io  sd6 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R65536
  50816  0 none
  disk_io sd21 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R65536
  50816  0 none
  disk_io sd48 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R65536
  50816  0 none
  disk_io sd48 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R   131072
  47839  0 none
  disk_io sd48 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R87552
  48095  0 none
  disk_io sd48 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R43520
  48352  0 none
  disk_io sd48 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R43520
  48523  0 none
  disk_io sd48 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R87552
  48950  0 none
  disk_io sd48 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R87552
  49121  0 none
  disk_io  sd6 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R   131072
  48096  0 none
  disk_io  sd6 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R87552
  48352  0 none
  disk_io sd21 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R43520
  48267  0 none
  disk_io sd21 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R43520
  48438  0 none
  disk_io  sd6 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R87552
  48523  0 none
  disk_io  sd6 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R43520
  48951  0 none
  disk_io sd21 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R87552
  49891  0 none
  disk_io sd21 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R44032
  50062  0 none
  disk_io  sd6 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL 
  PROTECTED]/[EMAIL PROTECTED],0:a  R87552
  49378  0 none
  disk_io sd13 
  /devices/[EMAIL PROTECTED],0/pci1022,[EMAIL PROTECTED]/pci11ab,[EMAIL

Re: [zfs-discuss] JBOD performance

2007-12-18 Thread Roch - PAE

Frank Penczek writes:
  Hi,
  
  On Dec 17, 2007 4:18 PM, Roch - PAE [EMAIL PROTECTED] wrote:

 The pool holds home directories so small sequential writes to one
 large file present one of a few interesting use cases.
  
   Can you be more specific here ?
  
   Do you have a body of application that would do small
   sequential writes; or one in particular ? Another
   interesting info is if we expect those to be allocating
   writes or overwrite (beware that some app, move the old file
   out, then run allocating writes, then unlink the original
   file).
  
  Sorry, I try to be more specific.
  The zpool contains home directories that are exported to client machines.
  It is hard to predict what exactly users are doing, but one thing users do 
  for
  certain is checking out software projects from our subversion server. The
  projects typically contain many source code files (thousands) and a
  build process
  accesses all of them in the worst case. That is what I meant by many (small)
  files like compiling projects in my previous post. The performance
  for this case
  is ... hopefully improvable.
  

This we'll have to work on. But first, If this is to
Storage with NVRAM, I assume you checked that the storage
does not flush it's caches :


http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#Cache_Flushes


If that is not your problem and if ZFS underperform another
FS on the backend of NFS, then this needs investigation.

If ZFS/NFS underperformance a direct attach FS that might
just be an NFS issue not related to ZFS. Again that needs
investigation. 

Performance gains won't happen unless we find out what
doesn't work.

  Now for sequential writes:
  We don't have a specific application issuing sequential writes but I
  can think of
  at least a few cases where these writes may occur, e.g.
  dumps of substantial amounts of measurement data or growing log files
  of applications.
  In either case these would be mainly allocating writes.
  

Right but I'd hope the application would issue substantially 
large writes specially if it' needs to dump  data at high rate.
If the data rate is more modest, then the CPU lost to this 
effect will itself be modest.

  Does this provide the information you're interested in?
  

I get a sense that it's  more important we find out what is
your build issue is. But the small writes will have to be
improved one day also.


-r

  
  Cheers,
Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] JBOD performance

2007-12-17 Thread Roch - PAE


dd uses a default block size of 512B.  Does this map to your
expected usage ? When I quickly tested the CPU cost of small
read from cache, I did see that ZFS was more costly than UFS
up to a crossover between 8K and 16K.   We might need a more
comprehensive study of that (data in/out of cache, different
recordsize  alignment constraints   ).   But  for small
syscalls, I think we might need some work  in ZFS to make it
CPU efficient.

So first,  does  small sequential writeto a large  file,
matches an interesting use case ?


-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] JBOD performance

2007-12-17 Thread Roch - PAE
Frank Penczek writes:
  Hi,
  
  On Dec 17, 2007 10:37 AM, Roch - PAE [EMAIL PROTECTED] wrote:
  
  
   dd uses a default block size of 512B.  Does this map to your
   expected usage ? When I quickly tested the CPU cost of small
   read from cache, I did see that ZFS was more costly than UFS
   up to a crossover between 8K and 16K.   We might need a more
   comprehensive study of that (data in/out of cache, different
   recordsize  alignment constraints   ).   But  for small
   syscalls, I think we might need some work  in ZFS to make it
   CPU efficient.
  
   So first,  does  small sequential writeto a large  file,
   matches an interesting use case ?
  
  The pool holds home directories so small sequential writes to one
  large file present one of a few interesting use cases.

Can you be more specific here ?

Do you have a body of application that would do small
sequential writes; or one in particular ? Another
interesting info is if we expect those to be allocating
writes or overwrite (beware that some app, move the old file 
out, then run allocating writes, then unlink the original
file).



  The performance is equally disappointing for many (small) files
  like compiling projects in svn repositories.
  

???

-r


  Cheers,
Frank

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Odd prioritisation issues.

2007-12-12 Thread Roch - PAE
Dickon Hood writes:
  On Fri, Dec 07, 2007 at 13:14:56 +, I wrote:
  : On Fri, Dec 07, 2007 at 12:58:17 +, Darren J Moffat wrote:
  : : Dickon Hood wrote:
  : : On Fri, Dec 07, 2007 at 12:38:11 +, Darren J Moffat wrote:
  : : : Dickon Hood wrote:
  
  : : : We're seeing the writes stall in favour of the reads.  For normal
  : : : workloads I can understand the reasons, but I was under the 
  impression
  : : : that real-time processes essentially trump all others, and I'm 
  surprised
  : : : by this behaviour; I had a dozen or so RT-processes sat waiting for 
  disc
  : : : for about 20s.
  
  : : : Are the files opened with O_DSYNC or does the application call fsync ?
  
  : : No.  O_WRONLY|O_CREAT|O_LARGEFILE|O_APPEND.  Would that help?
  
  : : Don't know if it will help, but it will be different :-).  I suspected 
  : : that since you put the processes in the RT class you would also be doing 
  : : synchronous writes.
  
  : Right.  I'll let you know on Monday; I'll need to restart it in the
  : morning.
  
  I was a tad busy yesterday and didn't have the time, but I've switched one
  of our recorder processes (the one doing the HD stream; ~17Mb/s,
  broadcasting a preview we don't mind trashing) to a version of the code
  which opens its file O_DSYNC as suggested.
  
  We've gone from ~130 write ops per second and 10MB/s to ~450 write ops per
  second and 27MB/s, with a marginally higher CPU usage.  This is roughly
  what I'd expect.
  
  We've artifically throttled the reads, which has helped (but not fixed; it
  isn't as determinative as we'd like) the starvation problem at the expense
  of increasing a latency we'd rather have as close to zero as possible.
  
  Any ideas?
  

O_DSYNC was  good idea. Then if  you  have recent Nevada you
can  use   the separate  intent log  (log   keyword in zpool
create)   to  absord thosewrites without   having splindle
competition with the reads. Your write workload should then
be well handled here (unless the incoming network processing 
is itself delayed).


-r


  Thanks.
  
  -- 
  Dickon Hood
  
  Due to digital rights management, my .sig is temporarily unavailable.
  Normal service will be resumed as soon as possible.  We apologise for the
  inconvenience in the meantime.
  
  No virus was found in this outgoing message as I didn't bother looking.
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Lack of physical memory evidences

2007-11-26 Thread Roch - PAE

Dmitry Degrave writes:
  In pre-ZFS era, we had observable parameters like scan rate and
  anonymous page-in/-out counters to discover situations when a system
  experiences a lack of physical memory. With ZFS, it's difficult to use
  mentioned parameters to figure out situations like that. Has someone
  any idea what we can use for the same purpose now ? 
  

Those should still work. 

What prevents them from being  effective markers  today is
that no   matter how  much memory  you  have,  a write heavy
workload (one that dirties data faster than disk drain) will
consume whatever you have and trigger the markers. 

If you don't have the above problem, then anonymous paging
is a good sign of memory shortage.

-r


  Thanks in advance,
  Dmeetry
   
   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-21 Thread Roch - PAE
Moore, Joe writes:
  Louwtjie Burger wrote:
   Richard Elling wrote:
   
- COW probably makes that conflict worse


   
This needs to be proven with a reproducible, real-world 
   workload before it
makes sense to try to solve it.  After all, if we cannot 
   measure where
we are,
how can we prove that we've improved?
   
   I agree, let's first find a reproducible example where updates
   negatively impacts large table scans ... one that is rather simple (if
   there is one) to reproduce and then work from there.
  
  I'd say it would be possible to define a reproducible workload that
  demonstrates this using the Filebench tool... I haven't worked with it
  much (maybe over the holidays I'll be able to do this), but I think a
  workload like:
  
  1) create a large file (bigger than main memory) on an empty ZFS pool.
  2) time a sequential scan of the file
  3) random write i/o over say, 50% of the file (either with or without
  matching blocksize)
  4) time a sequential scan of the file
  
  The difference between times 2 and 4 are the penalty that COW block
  reordering (which may introduce seemingly-random seeks between
  sequential blocks) imposes on the system.
  

But it's not the only thing. The difference between 2 and 4
is the COW penalty that one can hide under prefetching and
many spindles. 

The other thing is to see what is the impact (throughput and
response time) of the file  scan operation to the ever going
random write load.

Third is the impact on CPU cycles required to do the filescans.

-r

  It would be interesting to watch seeksize.d's output during this run
  too.
  
  --Joe
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] slog tests on read throughput exhaustion (NFS)

2007-11-19 Thread Roch - PAE
Neil Perrin writes:
  
  
  Joe Little wrote:
   On Nov 16, 2007 9:13 PM, Neil Perrin [EMAIL PROTECTED] wrote:
   Joe,
  
   I don't think adding a slog helped in this case. In fact I
   believe it made performance worse. Previously the ZIL would be
   spread out over all devices but now all synchronous traffic
   is directed at one device (and everything is synchronous in NFS).
   Mind you 15MB/s seems a bit on the slow side - especially is
   cache flushing is disabled.
  
   It would be interesting to see what all the threads are waiting
   on. I think the problem maybe that everything is backed
   up waiting to start a transaction because the txg train is
   slow due to NFS requiring the ZIL to push everything synchronously.
  
   
   I agree completely. The log (even though slow) was an attempt to
   isolate writes away from the pool. I guess the question is how to
   provide for async access for NFS. We may have 16, 32 or whatever
   threads, but if a single writer keeps the ZIL pegged and prohibiting
   reads, its all for nought. Is there anyway to tune/configure the
   ZFS/NFS combination to balance reads/writes to not starve one for the
   other. Its either feast or famine or so tests have shown.
  
  No there's no way currently to give reads preference over writes.
  All transactions get equal priority to enter a transaction group.
  Three txgs can be outstanding as we use a 3 phase commit model:
  open; quiescing; and syncing.

That makes me wonder if this is not just the lack of write
throttling issue. If one txg is syncing and the other is
quiesced out, I think it means we have let in too many
writes. We do need a better balance.

Neil is  it correct that  reads never hit txg_wait_open(), but
they just need an I/O scheduler slot ?

If so seems to me just a matter of 

6429205 each zpool needs to monitor it's throughput and throttle heavy 
writers

However, if this is it, disabling the zil would not solve the
issue (it might even make it worse). So I am lost as to
what could be blocking the reads other than lack of I/O
slots. As another way to improve I/O scheduler we have :


6471212 need reserved I/O scheduler slots to improve I/O latency of 
critical ops



-r

  
  Neil.
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + fragments

2007-11-13 Thread Roch - PAE

Louwtjie Burger writes:
  Hi
  
  After a clean database load a database would (should?) look like this,
  if a random stab at the data is taken...
  
  [8KB-m][8KB-n][8KB-o][8KB-p]...
  
  The data should be fairly (100%) sequential in layout ... after some
  days though that same spot (using ZFS) would problably look like:
  
  [8KB-m][   ][8KB-o][   ]
  
  Is this pseudo logical-physical view correct (if blocks n and p was
  updated and with COW relocated somewhere else)?
  

That's the proper view if the ZFS recordsize is tuned to be 8KB.
That's a best practice that might need to be qualified in
the future.


  Could a utility be constructed to show the level of fragmentation ?
  (50% in above example)
  

That will need to dive into the internals of ZFS. But
anything is possible.  It's been done for UFS before.


  IF the above theory is flawed... how would fragmentation look/be
  observed/calculated under ZFS with large Oracle tablespaces?
  
  Does it even matter what the fragmentation is from a performance 
  perspective?

It matters to table scans and how those scans will impact OLTP
workloads. Good blog topic. Stay tune.


  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS + DB + default blocksize

2007-11-12 Thread Roch - PAE
Louwtjie Burger writes:
  Hi
  
  What is the impact of not aligning the DB blocksize (16K) with ZFS,
  especially when it comes to random reads on single HW RAID LUN.
  
  How would one go about measuring the impact (if any) on the workload?
  

The DB will have a bigger in memory footprint as you
will need to keep the ZFS record for the lifespan of the DB
block.

This probably means you want to partition memory between 
DB cache/ZFS ARC cache according to the ratio of DB blocksize/ZFS recordize.

Then I imagine you have multiple spindles associated with
the lun. If you're lun is capable of 2000 IOPS over a
200MB/sec data channel then during 1 second at full speed :


2000 IOPS * 16K = 32MB of data transfer,

and this fits  in the channel capability.
But using say a ZFS blocks of 128K then

2000 IOPS * 128K = 256MB,

which  overload the  channel. So  in this  example the  data
channel would  saturate  first preventing you  from reaching
those 2000 IOPS.   But with enough  memory  and data channel
throughput then it's a  good idea to  keep the ZFS recordize
large.


-r


  Thank you
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Unreasonably high sys utilization during file create operations.

2007-11-07 Thread Roch - PAE

Was that with compression enabled ?

Got zpool status output ?

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] [Fwd: [Fwd: MySQL benchmark]]

2007-11-05 Thread Roch - PAE


   Original Message 
  Subject:  [zfs-discuss] MySQL benchmark
  Date: Tue, 30 Oct 2007 00:32:43 +
  From: Robert Milkowski [EMAIL PROTECTED]
  Reply-To: Robert Milkowski [EMAIL PROTECTED]
  Organization: CI TASK http://www.task.gda.pl
  To:   zfs-discuss@opensolaris.org



  Hello zfs-discuss,

http://dev.mysql.com/tech-resources/articles/mysql-zfs.html


I've just quickly glanced thru it.
However the argument about double buffering problem is not valid.


  -- 
  Best regards,
   Robert Milkowski  mailto:[EMAIL PROTECTED]
   http://milek.blogspot.com

  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

  --- end of forwarded message ---



I absolutely agree with Robert here. Data is cached once in the
Database and, absent directio, _some_ extra memory is required 
to stage the I/Os. On read it's a tiny amount since memory
can be reclaimed as soon as it's copied to user space.

1 threads each waiting for 8K will serviced using 80M
of extra memory.

On the write path we need to stage the data for the purpose
of a ZFS transaction group. When the dust settles we will be 
able to do this every 5 seconds. So what percentage of 
DB blocks are modified in 5 to 10 seconds ?

If the answer is 5% then yes, the lack of directio is a
concern for you.


-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZIL reliability/replication questions

2007-10-24 Thread Roch - PAE

  
   This should work. It shouldn't even lose the in-flight transactions.
   ZFS reverts to using the main pool if a slog write fails or the
   slog fills up.

  So, the only way to lose transactions would be a crash or power loss,
  leaving outstanding transactions in the log, followed by the log
  device failing to start up on reboot?  I assume that that would that
  be handled relatively cleanly (files have out of data data), as
  opposed to something nasty like the pool fails to start up.


It's   just data loss   from zpool perspective. However it's
data  loss from application  commited  data. So applications
that relied on commiting data for their own consistency might
end up with a corrupted view of the  world. NFS clients fall
in this bin.

Mirroring the  NVRAM cards in  the Separate Intent log seems
like a very good idea. 

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Sequential reading/writting from large stripe faster on SVM than ZFS?

2007-10-24 Thread Roch - PAE

I would suspect the checksum part of this (I do believe it's being
actively worked on) :

6533726 single-threaded checksum  raidz2 parity calculations limit 
write bandwidth on thumper

-r

Robert Milkowski writes:
  Hi,
  
  snv_74, x4500, 48x 500GB, 16GB RAM, 2x dual core
  
  # zpool create test c0t0d0 c0t1d0 c0t2d0 c0t3d0 c0t4d0 c0t5d0 c0t6d0 c0t7d0 
  c1t0d0 c1t1d0 c1t2d0 c1t3d0 c1t4d0 c1t5d0 c1t6d0 c1t7d0 c4t0d0 c4t1d0 c4t2d0 
  c4t3d0 c4t4d0 c4t5d0 c4t6d0 c4t7d0 c5t1d0 c5t2d0 c5t3d0 c5t5d0 c5t6d0 c5t7d0 
  c6t0d0 c6t1d0 c6t2d0 c6t3d0 c6t4d0 c6t5d0 c6t6d0 c6t7d0 c7t0d0 c7t1d0 c7t2d0 
  c7t3d0 c7t4d0 c7t5d0 c7t6d0 c7t7d0
  [46x 500GB]
  
  # ls -lh /test/q1
  -rw-r--r--   1 root root 82G Oct 18 09:43 /test/q1
  
  # dd if=/test/q1 of=/dev/null bs=16384k 
  # zpool iostat 1
 capacity operationsbandwidth
  pool used  avail   read  write   read  write
  --  -  -  -  -  -  -
  test 213G  20.6T645120  80.1M  14.7M
  test 213G  20.6T  9.26K  0  1.16G  0
  test 213G  20.6T  9.66K  0  1.21G  0
  test 213G  20.6T  9.41K  0  1.18G  0
  test 213G  20.6T  9.41K  0  1.18G  0
  test 213G  20.6T  7.45K  0   953M  0
  test 213G  20.6T  7.59K  0   971M  0
  test 213G  20.6T  7.41K  0   948M  0
  test 213G  20.6T  8.25K  0  1.03G  0
  test 213G  20.6T  9.17K  0  1.15G  0
  test 213G  20.6T  9.54K  0  1.19G  0
  test 213G  20.6T  9.89K  0  1.24G  0
  test 213G  20.6T  9.41K  0  1.18G  0
  test 213G  20.6T  9.31K  0  1.16G  0
  test 213G  20.6T  9.80K  0  1.22G  0
  test 213G  20.6T  8.72K  0  1.09G  0
  test 213G  20.6T  7.86K  0  1006M  0
  test 213G  20.6T  7.21K  0   923M  0
  test 213G  20.6T  7.62K  0   975M  0
  test 213G  20.6T  8.68K  0  1.08G  0
  test 213G  20.6T  9.81K  0  1.23G  0
  test 213G  20.6T  9.57K  0  1.20G  0
  
  So it's around 1GB/s.
  
  # dd if=/dev/zero of=/test/q10 bs=128k 
  # zpool iostat 1
 capacity operationsbandwidth
  pool used  avail   read  write   read  write
  --  -  -  -  -  -  -
  test 223G  20.6T656170  81.5M  20.8M
  test 223G  20.6T  0  8.10K  0  1021M
  test 223G  20.6T  0  7.94K  0  1001M
  test 216G  20.6T  0  6.53K  0   812M
  test 216G  20.6T  0  7.19K  0   906M
  test 216G  20.6T  0  6.78K  0   854M
  test 216G  20.6T  0  7.88K  0   993M
  test 216G  20.6T  0  10.3K  0  1.27G
  test 222G  20.6T  0  8.61K  0  1.04G
  test 222G  20.6T  0  7.30K  0   919M
  test 222G  20.6T  0  8.16K  0  1.00G
  test 222G  20.6T  0  8.82K  0  1.09G
  test 225G  20.6T  0  4.19K  0   511M
  test 225G  20.6T  0  10.2K  0  1.26G
  test 225G  20.6T  0  9.15K  0  1.13G
  test 225G  20.6T  0  8.46K  0  1.04G
  test 225G  20.6T  0  8.48K  0  1.04G
  test 225G  20.6T  0  10.9K  0  1.33G
  test 231G  20.6T  0  3  0  3.96K
  test 231G  20.6T  0  0  0  0
  test 231G  20.6T  0  0  0  0
  test 231G  20.6T  0  9.02K  0  1.11G
  test 231G  20.6T  0  12.2K  0  1.50G
  test 231G  20.6T  0  9.14K  0  1.13G
  test 231G  20.6T  0  10.3K  0  1.27G
  test 231G  20.6T  0  9.08K  0  1.10G
  test 237G  20.6T  0  0  0  0
  test 237G  20.6T  0  0  0  0
  test 237G  20.6T  0  6.03K  0   760M
  test 237G  20.6T  0  9.18K  0  1.13G
  test 237G  20.6T  0  8.40K  0  1.03G
  test 237G  20.6T  0  8.45K  0  1.04G
  test 237G  20.6T  0  11.1K  0  1.36G
  
  Well, writing could be faster than reading here... there're gaps due to bug 
  6415647 I guess.
  
  
  # zpool destroy test
  
  # metainit d100 1 46 c0t0d0s0 c0t1d0s0 c0t2d0s0 c0t3d0s0 c0t4d0s0 c0t5d0s0 
  c0t6d0s0 c0t7d0s0 c1t0d0s0 c1t1d0s0 c1t2d0s0 c1t3d0s0 c1t4d0s0 c1t5d0s0 
  c1t6d0s0 c1t7d0s0 c4t0d0s0 c4t1d0s0 c4t2d0s0 c4t3d0s0 c4t4d0s0 c4t5d0s0 
  c4t6d0s0 c4t7d0s0 c5t1d0s0 c5t2d0s0 c5t3d0s0 c5t5d0s0 c5t6d0s0 c5t7d0s0 
  c6t0d0s0 c6t1d0s0 c6t2d0s0 c6t3d0s0 c6t4d0s0 c6t5d0s0 c6t6d0s0 c6t7d0s0 
  c7t0d0s0 c7t1d0s0 c7t2d0s0 c7t3d0s0 c7t4d0s0 c7t5d0s0 c7t6d0s0 c7t7d0s0 -i 
  128k
  d100: Concat/Stripe is setup
  [46x 500GB]
  
  And I get not so good results - maximum 1GB of reading... h...
  
  maxphys is 56K - I thought 

Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-04 Thread Roch - PAE
Jim Mauro writes:
  
   Where does the win come from with directI/O?  Is it 1), 2), or some  
   combination?  If its a combination, what's the percentage of each  
   towards the win?
 
  That will vary based on workload (I know, you already knew that ... :^).
  Decomposing the performance win between what is gained as a result of 
  single writer
  lock breakup and no caching is something we can only guess at, because, 
  at least
  for UFS, you can't do just one - it's all or nothing.
   We need to tease 1) and 2) apart to have a full understanding.  
  
  We can't. We can only guess (for UFS).
  
  My opinion - it's a must-have for ZFS if we're going to get serious 
  attention
  in the database space. I'll bet dollars-to-donuts that, over the next 
  several years,
  we'll burn many tens-of-millions of dollars on customer support 
  escalations that
  come down to memory utilization issues and contention between database
  specific buffering and the ARC. This is entirely my opinion (not that of 
  Sun),


...memory utilisation... OK so we should implement the 'lost cause' rfe.

In all cases, ZFS must not steal pages from other memory consumers :

6488341 ZFS should avoiding growing the ARC into trouble

So the DB memory pages should not be _contented_ for. 

-r

  and I've been wrong before.
  
  Thanks,
  /jim
  
  
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-04 Thread Roch - PAE

eric kustarz writes:
  
   Anyhow, in the case of DBs, ARC indeed becomes a vestigial organ. I'm
   surprised that this is being met with skepticism considering that
   Oracle highly recommends direct IO be used,  and, IIRC, Oracle
   performance was the main motivation to adding DIO to UFS back in
   Solaris 2.6. This isn't a problem with ZFS or any specific fs per se,
   it's the buffer caching they all employ. So I'm a big fan of seeing
   6429855 come to fruition.
  
  The point is that directI/O typically means two things:
  1) concurrent I/O
  2) no caching at the file system
  

In my blog I also mention :

   3) no readahead (but can be viewed as an implicit consequence of 2)

And someone chimed in with

   4) ability to do I/O at the sector granularity.


I also think that for many 2) is too weak form of what they
expect :

   5) DMA straight from user buffer to disk avoiding a copy.


So
 
   1) concurrent I/O we have in ZFS.

   2) No Caching.
  we could do by taking a directio hint and evict 
  arc buffer immediately after copyout to user space
  for reads,  and after txg completion for writes.

   3) No prefetching.
  we have 2 level of prefetching. The low level was
  fixed recently. Should not cause problem to DB loads.
  The high level still needs fixing on it's own.
  Then we should take the same hint as 2) to disable it
  altogether. In the mean time we can tune our way into 
  this mode.

   4) Sector sized I/O
  Is really foreign to ZFS design.

   5) Zero Copy  more CPU efficientcy.
  I think is where the debate is.



My line has been that 5) won't help latency much and latency is
where I think the game is currently played. Now the
disconnect might be because people might feel that the game
is not latency but CPU efficientcy : how many CPU cycles to I
burn to do get data from disk to user buffer. This is a
valid point. Configurations can with very large number
of disks end up saturated by the filesystem CPU utilisation.

So I still think that the major area  for ZFS perf gains are
on the latency  front : block  allocation (now much improved
with  the Separate  intent log),  I/O  scheduling, and other
fixes to the threading  ARC behavior.  But at some point we
can turn  our microscope onthe CPU efficientcy  of   the
implementation.   The copy will certainly be  a big chunk of
the CPU cost per  I/O but I would still  like to gather that
data.

Also  consider, 50  disks at  200 IOPS of   8K is 80 MB/sec.
That means maybe  1/10th  of a single  CPU  to  be saved  by
avoiding just   the copy. Probably  not  what people have in
mind.  How many CPU's do you have when attaching 1000 drives 
to a host running a 100TB database ? That many drivers will barely 
occupy 2 cores running the copies.

People want  performance and efficientcy. Directio is
just an overloaded name that  delivered those gains to other
filesystems.

Right now, what I think  is worth gathering is cycles  spent
in ZFS per reads  writes in a large DB environment where DB
holds  90%  of memory.  For  comparison with another  FS, we
should disable checksum, file prefetching, vdev prefetching,
cap the  ARC, atime  off,  8K  recordsize.  A breakdown  and
comparison  of   the  CPU  cost per   layer   will  be quite
interesting and points to what needs work.

Another interesting thing for me would be : what is your
budget ?

how   many cycles per DB   reads and writes are you
willing to spend and how did you come to that number


But, as Eric says, let's develop 2 and I'll try  in parallel to 
figure out the per layer breakdown cost.

-r



  Most file systems (ufs, vxfs, etc.) don't do 1) or 2) without turning  
  on directI/O.
  
  ZFS *does* 1.  It doesn't do 2 (currently).
  
  That is what we're trying to discuss here.
  
  Where does the win come from with directI/O?  Is it 1), 2), or some  
  combination?  If its a combination, what's the percentage of each  
  towards the win?
  
  We need to tease 1) and 2) apart to have a full understanding.  I'm  
  not against adding 2) to ZFS but want more information.  I suppose  
  i'll just prototype it and find out for myself.
  
  eric

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-04 Thread Roch - PAE

Nicolas Williams writes:

  On Thu, Oct 04, 2007 at 03:49:12PM +0200, Roch - PAE wrote:
   ...memory utilisation... OK so we should implement the 'lost cause' rfe.
   
   In all cases, ZFS must not steal pages from other memory consumers :
   
  6488341 ZFS should avoiding growing the ARC into trouble
   
   So the DB memory pages should not be _contented_ for. 
  
  What if your executable text, and pretty much everything lives on ZFS?
  You don't want to content for the memory caching those things either.
  It's not just the DB's memory you don't want to contend for.

On the read side, 

We're talking  here  about  1000   disks each  running35
concurrent I/Os of 8K, so a footprint of 250MB, to stage a
ton of work.

On the write side  we do have  to play with  the transaction
group  so  that will be  5-10   seconds worth of synchronous
write activity.

But how much memory does a 1000-disks server got ?

-r




___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-04 Thread Roch - PAE

Nicolas Williams writes:
  On Wed, Oct 03, 2007 at 04:31:01PM +0200, Roch - PAE wrote:
 It does, which leads to the core problem. Why do we have to store the
 exact same data twice in memory (i.e., once in the ARC, and once in
 the shared memory segment that Oracle uses)? 
   
   We do not retain 2 copies of the same data.
   
   If the DB cache is made large enough to consume most of memory,
   the ZFS copy will quickly be evicted to stage other I/Os on
   their way to the DB cache.
   
   What problem does that pose ?
  
  Other things deserving of staying in the cache get pushed out by things
  that don't deserve being in the cache.  Thus systemic memory pressure
  (e.g., more on-demand paging of text).
  
  Nico
  -- 

I agree. That's why I submitted both of these.

6429855 Need way to tell ZFS that caching is a lost cause
6488341 ZFS should avoiding growing the ARC into trouble

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Roch - PAE
Rayson Ho writes:

  1) Modern DBMSs cache database pages in their own buffer pool because
  it is less expensive than to access data from the OS. (IIRC, MySQL's
  MyISAM is the only one that relies on the FS cache, but a lot of MySQL
  sites use INNODB which has its own buffer pool)
  

The DB can and should cache data whether or not directio is used.

  2) Also, direct I/O is faster because it avoid double buffering.
  

A piece of data can be in one buffer, 2 buffers, 3
buffers. That says nothing about performance. More below.

So I guess  you  mean DIO  is  faster because it  avoids the
extra copy: dma straight to  user buffer rather than DMA  to
kernel buffer then copy to user buffer. If an  I/O is 5ms an
8K copy is  about 10 usec. Is  avoiding the copy  really the
most urgent thing to work on ?



  Rayson
  
  
  
  
  On 10/2/07, eric kustarz [EMAIL PROTECTED] wrote:
   Not yet, see:
   6429855 Need way to tell ZFS that caching is a lost cause
  
   Is there a specific reason why you need to do the caching at the DB
   level instead of the file system?  I'm really curious as i've got
   conflicting data on why people do this.  If i get more data on real
   reasons on why we shouldn't cache at the file system, then this could
   get bumped up in my priority queue.
  

I can't answer this although can well imagine that the DB is 
the most efficent place to cache it's own data all organised 
and formatted to respond to queries. 

But  once the  DB has signified   to the FS that it  doesn't
require the FS to cache data  then the benefit from this RFE
is  that the memory used to   stage the data  can be quickly
recycled by ZFS  for  subsequent  operations. It  means  ZFS
memory footprint   is  more  likely to containuseful ZFS
metadata and not cached data block we know are not likely to
be used again anytime soon.

We also would operated better in mixed DIO/non-DIO workloads.


See also:
http://blogs.sun.com/roch/entry/zfs_and_directio

-r



   eric
   ___
   zfs-discuss mailing list
   zfs-discuss@opensolaris.org
   http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Direct I/O ability with zfs?

2007-10-03 Thread Roch - PAE

Matty writes:
  On 10/3/07, Roch - PAE [EMAIL PROTECTED] wrote:
   Rayson Ho writes:
  
 1) Modern DBMSs cache database pages in their own buffer pool because
 it is less expensive than to access data from the OS. (IIRC, MySQL's
 MyISAM is the only one that relies on the FS cache, but a lot of MySQL
 sites use INNODB which has its own buffer pool)

  
   The DB can and should cache data whether or not directio is used.
  
  It does, which leads to the core problem. Why do we have to store the
  exact same data twice in memory (i.e., once in the ARC, and once in
  the shared memory segment that Oracle uses)? 

We do not retain 2 copies of the same data.

If the DB cache is made large enough to consume most of memory,
the ZFS copy will quickly be evicted to stage other I/Os on
their way to the DB cache.

What problem does that pose ?

-r

  
  Thanks,
  - Ryan
  -- 
  UNIX Administrator
  http://prefetch.net
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS array NVRAM cache?

2007-09-26 Thread Roch - PAE

Vincent Fox writes:
  I don't understand.  How do you
  
  setup one LUN that has all of the NVRAM on the array dedicated to it
  
  I'm pretty familiar with 3510 and 3310. Forgive me for being a bit
  thick here, but can you be more specific for the n00b?
  
  Do you mean from firmware side or OS side?  Or since the LUNs used
  for the ZIL are separated out from the other disks in the pool they DO
  get to make use of the NVRAM, is that it? 
  
  I have a pair of 3310 with 12 36-gig disks for testing.  I have a
  V240 with PCI dual-SCSI controller so I can drive one array from each
  port is what I am tinkering with right now.  Looking for maximum
  reliability/redundancy of course so I would ZFS mirror the arrays.
  
  Can you suggest a setup here?  A single-disk from each array
  exported as a LUN, then ZFS mirrored together for the ZIL log?
  An example would be helpful.  Could I then just lump all the
  remaining disks into a 10-disk RAID-5 LUN, mirror them together
  and achieve a significant performance improvement?  Still have
  to have a global spare of course in the HW RAID.   What about
  sparing for the ZIL?
   

With 

PSARC 2007/171 ZFS Separate Intent Log

now in Nevada, you can setup the ZIL on it's own set of
(possibly very fast) luns. The luns can be mirrored if you
have more than one NVRAM cards.

http://blogs.sun.com/perrin/entry/slog_blog_or_blogging_on

This will work great to accelerate JBOD using just a small
amount of NVRAM for the ZIL. When a storage is fronted 100%
by NVRAM the benefits of the slog won't be as large.

Last week we had this putback :

PSARC 2007/053 Per-Disk-Device support of non-volatile cache
6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE CACHE 
to SBC-2 devices

which will  preventsome recognised Arrays   from   doing
unnecessary  cache flushes  and  allow tuning  others  using
sd.conf. Otherwise arrays will be queried for support of the
SYNC_NV capability.  IMO, the best is to bug storage vendors
into supporting SYNC_NV.

For earlier releases, to get the full benefit of the NVRAM on 
zil operations you are stuck into a raw tuning  proposition :

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide


http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#FLUSH

-r

See also :

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide


   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] io:::start and zfs filenames?

2007-09-26 Thread Roch - PAE

Neelakanth Nadgir writes:
  io:::start probe does not seem to get zfs filenames in
  args[2]-fi_pathname. Any ideas how to get this info?
  -neel
  

Who says an I/O is doing work for a single pathname/vnode
or for a single process. There is not that one to one
correspondance  anymore. Not in the ZIL and not in the 
transaction groups due to I/O aggregations. 

As for mmaped I/O, follow Jim's advice, I guess fsflush will 
be issueing some  putpage :

  fsinfo   genunix   fop_putpage putpage

-r

  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS ARC DNLC Limitation

2007-09-25 Thread Roch - PAE

Hi Jason, This should have helped.

6542676 ARC needs to track meta-data memory overhead

Some of the lines to arc.c:

1551  1.36  if (arc_meta_used = arc_meta_limit) {
1552/*
1553 * We are exceeding our meta-data cache limit.
1554 * Purge some DNLC entries to release holds on 
meta-data.
1555 */
1556dnlc_reduce_cache((void 
*)(uintptr_t)arc_reduce_dnlc_percent);
1557}

-r



Jason J. W. Williams writes:
  Hello All,
  
  Awhile back (Feb '07) when we noticed ZFS was hogging all the memory
  on the system, y'all were kind enough to help us use the arc_max
  tunable to attempt to limit that usage to a hard value. Unfortunately,
  at the time a sticky problem was that the hard limit did not include
  DNLC entries generated by ZFS.
  
  I've been watching the list since then and trying to watch the Nevada
  commits. I haven't noticed that anything has been committed back so
  that arc_max truly enforces the max amount of memory ZFS is allowed to
  consume (including DNLC entries). Has this been corrected and I just
  missed it? Thank you in advance for you any help.
  
  Best Regards,
  Jason
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS (and quota)

2007-09-24 Thread Roch - PAE

Pawel Jakub Dawidek writes:
  I'm CCing zfs-discuss@opensolaris.org, as this doesn't look like
  FreeBSD-specific problem.
  
  It looks there is a problem with block allocation(?) when we are near
  quota limit. tank/foo dataset has quota set to 10m:
  
  Without quota:
  
   FreeBSD:
   # dd if=/dev/zero of=/tank/test bs=512 count=20480
   time: 0.7s
  
   Solaris:
   # dd if=/dev/zero of=/tank/test bs=512 count=20480
   time: 4.5s
  
  With quota:
  
   FreeBSD:
   # dd if=/dev/zero of=/tank/foo/test bs=512 count=20480
   dd: /tank/foo/test: Disc quota exceeded
   time: 306.5s
  
   Solaris:
   # dd if=/dev/zero of=/tank/foo/test bs=512 count=20480
   write: Disc quota exceeded
   time: 602.7s
  
  CPU is almost entirely idle, but disk activity seems to be high.
  


Yes, as we are near quota limit, each transaction group
will accept a small amount as to not overshoot the limit.

I don't know if we have the optimal strategy yet.

-r

  Any ideas?
  
  -- 
  Pawel Jakub Dawidek   http://www.wheel.pl
  [EMAIL PROTECTED]   http://www.FreeBSD.org
  FreeBSD committer Am I Evil? Yes, I Am!
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs and small files

2007-09-21 Thread Roch - PAE

Claus Guttesen writes:
I have many small - mostly jpg - files where the original file is
approx. 1 MB and the thumbnail generated is approx. 4 KB. The files
are currently on vxfs. I have copied all files from one partition onto
a zfs-ditto. The vxfs-partition occupies 401 GB and zfs 449 GB. Most
files uploaded are in jpg and all thumbnails are always jpg.
  
   Is there a problem?
  
  Not by the diskusage itself. But if zfs takes up more space than vxfs
  (12 %) 80 TB will become 89 TB instead (our current storage) and add
  cost.
  
   Also, how are you measuring this (what commands)?
  
  I did a 'df -h'.
  
Will a different volblocksize (during creation of the partition) make
better use of the available diskspace? Will (meta)data require less
space if compression is enabled?
  
   volblocksize won't have any affect on file systems, it is for zvols.
   Perhaps you mean recordsize?  But recall that recordsize is the maximum 
   limit,
   not the actual limit, which is decided dynamically.
  
I read 
http://www.opensolaris.org/jive/thread.jspa?threadID=37673tstart=105
which is very similar to my case except for the file type. But no
clear pointers otherwise.
  
   A good start would be to find the distribution of file sizes.
  
  The files are approx. 1 MB with an thumbnail of approx. 4 KB.
  

So the 1 MB files are stored as ~8 x 128K recordsize.

Because of 
5003563 use smaller tail block for last block of object

The last block of you file is partially used. It will depend 
on your filesize distribution by without that info we can
only guess that we're wasting an avg of 64K per file. Or 6%.

If your distribution is such that most files are slightly
more than 1M, then we'd have 12% overhead from this effect.

So using 16K/32K recordsize would quite possibly help as
files would be stored using ~ 64 x 16K blocks with an
overhead of  1-2% (0.5 blocks wasted  every 64).


-r


  -- 
  regards
  Claus
  
  When lenity and cruelty play for a kingdom,
  the gentlest gamester is the soonest winner.
  
  Shakespeare
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs and small files

2007-09-21 Thread Roch - PAE

Claus Guttesen writes:
   So the 1 MB files are stored as ~8 x 128K recordsize.
  
   Because of
   5003563 use smaller tail block for last block of object
  
   The last block of you file is partially used. It will depend
   on your filesize distribution by without that info we can
   only guess that we're wasting an avg of 64K per file. Or 6%.
  
   If your distribution is such that most files are slightly
   more than 1M, then we'd have 12% overhead from this effect.
  
   So using 16K/32K recordsize would quite possibly help as
   files would be stored using ~ 64 x 16K blocks with an
   overhead of  1-2% (0.5 blocks wasted  every 64).
  
  I will (re)create the partition and modify the recordsize. I was
  unwilling to do so when I read the man page which discourages
  modifying this setting unless a database was used.
  
  Does zfs use suballocation if a file does not use an entire
  recordsize? If not the thumbnails probably wastes most space. They are
  approx. 4 KB.
  

Files smaller than 'recordsize' are stored using a multiple
of the sector size. So small files should not factor in this 
equation.


  I'll be testing recordsizes from 1K and upwards. Actually 1K made zfs
  very slow but 2K seems fine. I'll report back when the entire
  partition has been copied. When I find the sweet spot I'll try to
  enable (default) compression.

Beware because  at 2K you  might be generating more indirect
blocks.  For 1MB files   the  gains from using  a recordsize
smaller than 16K start to be quite small.

-r

  
  Thank you.
  
  -- 
  regards
  Claus
  
  When lenity and cruelty play for a kingdom,
  the gentlest gamester is the soonest winner.
  
  Shakespeare

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] question about uberblock blkptr

2007-09-20 Thread Roch - PAE

[EMAIL PROTECTED] writes:
  Roch - PAE wrote:
   [EMAIL PROTECTED] writes:
 Jim Mauro wrote:
 
  Hey Max - Check out the on-disk specification document at
  http://opensolaris.org/os/community/zfs/docs/.
 
  Page 32 illustration shows the rootbp pointing to a dnode_phys_t
  object (the first member of a objset_phys_t data structure).
 
  The source code indicates ub_rootbp is a blkptr_t, which contains
  a 3 member array of dva_t 's called blk_dva (blk_dva[3]).
  Each dva_t is a 2 member array of 64-bit unsigned ints (dva_word[2]).
 
  So it looks like each blk_dva contains 3 128-bit DVA's
 
  You probably figured all this out alreadydid you try using
  a objset_phys_t to format the data?
 
  Thanks,
  /jim
 Ok.  I think I know what's wrong.  I think the information (most 
   likely, 
 a objset_phys_t) is compressed
 with lzjb compression.  Is there a way to turn this entirely off (not 
 just for file data, but for all meta data
 as well when a pool is created?  Or do I need to figure out how to hack 
 in the lzjb_decompress() function in
 my modified mdb?  (Also, I figured out that zdb is already doing the 
 left shift by 9 before dumping DVA values,
 for anyone following this...).
 
  
   Max, this might help (zfs_mdcomp_disable) :
   http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#METACOMP
 
  Hi Roch,
  That would help, except it does not seem to work.  I set 
  zfs_mdcomp_disable to 1 with mdb,
  deleted the pool, recreated the pool, and zdb - still shows the 
  rootbp in the uberblock_t
  to have the lzjb flag turned on.  So I then added the variable to 
  /etc/system, destroyed the pool,
  rebooted, recreated the pool, and still the same result.  Also, my mdb 
  shows the same thing
  for the uberblock_t rootbp blkptr data.   I am running Nevada build 55b.
  
  I shall update the build I am running soon, but in the meantime I'll 
  probably write a modified cmd_print() function for my
  (modified)  mdb to handle (at least) lzjb compressed metadata.  Also, I 
  think the ZFS Evil Tuning Guide should be
  modified.  It says this can be tuned for Solaris 10 11/06 and snv_52.  I 
  guess that means only those
  two releases.  snv_55b has the variable, but it doesn't have an effect 
  (at least on the uberblock_t
  rootbp meta-data).
  
  thanks for your help.
  
  max
  

My bad. The tunable only affects indirect  dbufs (so I guess
only for  large  files). As  you   noted, other metadata  is
compressed unconditionaly(I  guess from the use   of
ZIO_COMPRESS_LZJB in dmu_objset_open_impl).

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] ZFS Evil Tuning Guide

2007-09-17 Thread Roch - PAE

Tuning should not be done in general and Best practices
should be followed.

So get very much acquainted with this first :

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide

Then if you must, this could soothe or sting : 

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide

So drive carefully.

-r


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS Evil Tuning Guide

2007-09-17 Thread Roch - PAE

Pawel Jakub Dawidek writes:
  On Mon, Sep 17, 2007 at 03:40:05PM +0200, Roch - PAE wrote:
   
   Tuning should not be done in general and Best practices
   should be followed.
   
   So get very much acquainted with this first :
   
  http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
   
   Then if you must, this could soothe or sting : 
   
  http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide
   
   So drive carefully.
  
  If some LUNs exposed to ZFS are not protected by NVRAM, then this
  tuning can lead to data loss or application level corruption.  However
  the ZFS pool integrity itself is NOT compromised by this tuning.
  
  Are you sure? Once you turn off flushing cache, how can you tell that
  your disk didn't reorder writes and uberblock was updated before new
  blocks were written? Will ZFS go the the previous blocks when the newest
  uberblock points at corrupted data?
  

Good point. I'll fix this. I don't know if we look for
alternate uberblock but even if we did, I guess the 'out of
sync' can occur lower down the tree.


-r


  -- 
  Pawel Jakub Dawidek   http://www.wheel.pl
  [EMAIL PROTECTED]   http://www.FreeBSD.org
  FreeBSD committer Am I Evil? Yes, I Am!
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Supporting recordsizes larger than 128K?

2007-09-05 Thread Roch - PAE
Matty writes:
  Are there any plans to support record sizes larger than 128k? We use
  ZFS file systems for disk staging on our backup servers (compression
  is a nice feature here), and we typically configure the disk staging
  process to read and write large blocks (typically 1MB or so). This
  reduces the number of I/Os that take place to our storage arrays, and
  our testing has shown that we can push considerably more I/O with 1MB+
  block sizes.
  

So  other  FS and raw  devices  clearly benefit  from larger
blocksize but the way ZFS schedule such I/Os, I don't expect
anymore more throughput from bigger blocks.

Maybe you're hitting something else that limits throughput ?

-r


  Thanks for any insight,
  - Ryan
  -- 
  UNIX Administrator
  http://prefetch.net
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is ZFS efficient for large collections of small files?

2007-08-22 Thread Roch - PAE

  Brandorr wrote:
   Is ZFS efficient at handling huge populations of tiny-to-small files -
   for example, 20 million TIFF images in a collection, each between 5
   and 500k in size?
  
   I am asking because I could have sworn that I read somewhere that it
   isn't, but I can't find the reference.
 
  If you're worried about the I/O throughput, you should avoid RAIDZ1/2 
  configurations. random read performance will be desastrous if you do; 

A raid-z group  can do one random  read per I/O latency.  So
for 8 disks (each capable of 200 IOPS) in a zpool split into
2  raid-z groups  should  be able  to  server 400  files per
second. If you need to serve more  files, then you need more
disks or  need to use  mirroring. With mirroring, I'd expect
to   serve 1600 files (8*200).  This  model  only applies to
random reading, not sequential access,  not to any types  of
write loads.

For  small file creation ZFS can   be extremely efficient in
that it can create more than 1  file per I/O. It should also
approach disk streaming performance for write loads.

  I've seen random reads ratios with less than 1 MB/s on a X4500 with 40 
  dedicated disks for data storage. 

It would  be nice to  see  if the  above model matches  your
data. So if you have  all 40 disks  in a single raid-z group
(an anti  best  practice) I'd  expect 200  files served per
second and if the files were of 5K avg  size then I'd expect
that 1MB/sec.

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide,

  If you don't have to worry about disk 
  space, use mirrors;  

right on !

  I got my best results during my extensive X4500 
  benchmarking sessions, when I mirrored single slices instead of complete 
  disks (resulting in 40 2-way-mirrors on 40 physical discs, mirroring 
  c0t0d0s0-c0t1d0s1 and c0t1d0s0-c0t0d0s1, and so on). If you're worried 
  about disk space,  you should consider striping several instances of 
  RAIDZ1 arrays, each one consisting of three discs or slices. sequential 
  access will  go down the cliff,  but random reads will be boosted.

Writes should be good if not great, no matter what the
workload is. I'm interested in data that shows otherwise.

  You should also adjust the recordsize. 

For small files I certainly would not. 
Small files are stored as single record when they are
smaller than the recordsize. Single record is good in my
book. Not sure when one would want otherwise for small files.


  Try to measure the average I/O 
  transaction size. There's a good chance that your I/O performance will 
  be best if you set your recordsize to a smaller value. For instance, if 
  your average file size is 12 KB, try using 8K or even 4K recordsize, 
  stay away from 16K or higher.

Tuning the record size is currently only recommended for
databases (large file) with fixed record access. Again it's
interesting input if tuning the recordsize helped another
type of workload.

-r

  -- 

  Ralf Ramge
  Senior Solaris Administrator, SCNA, SCSA

  Tel. +49-721-91374-3963 
  [EMAIL PROTECTED] - http://web.de/

  11 Internet AG
  Brauerstraße 48
  76135 Karlsruhe

  Amtsgericht Montabaur HRB 6484

  Vorstand: Henning Ahlert, Ralph Dommermuth, Matthias Ehrlich, Andreas Gauger, 
Matthias Greve, Robert Hoffmann, Norbert Lang, Achim Weiss 
  Aufsichtsratsvorsitzender: Michael Scheeren

  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Odp: Is ZFS efficient for large collections of small files?

2007-08-22 Thread Roch - PAE
£ukasz K writes:
   Is ZFS efficient at handling huge populations of tiny-to-small files -
   for example, 20 million TIFF images in a collection, each between 5
   and 500k in size?
   
   I am asking because I could have sworn that I read somewhere that it
   isn't, but I can't find the reference.
  
  It depends, what type of I/O you will do. If only reads, there is no 
  problem. Writting small files ( and removing ) will fragmentate pool
  and it will be a huge problem.
  You can set recordsize to 32k ( or 16k ) and it will help for some time.
  

Comparing recordsize of 16K with 128K.

Files in the range of [0,16K] : no difference.
Files in the range of [16K,128K]  : more efficient to use 128K
Files in the range of [128K,500K] : more efficient to use 16K

In the [16K,128K] range the actual filesize is rounded up to 
16K with 16K recordsize and to the nearest 512B boundary
with 128K recordsize. This will be fairly catastrophic for
files slightly above 16K (rounded up to 32K vs 16K+512B).

In the [128K, 500K] range we're hurt by this

5003563 use smaller tail block for last block of object

until   it is  fixed, then  yes , files stored using  16K
records are  rounded up more tightly. metadata probably
east parts of the gains.

-r


  Lukas
  
  
  CLUBNETIC SUMMER PARTY 2007
  House, club, electro. Najlepsza kompilacja na letnie imprezy!
  http://klik.wp.pl/?adr=http%3A%2F%2Fadv.reklama.wp.pl%2Fas%2Fclubnetic.htmlsid=1266
  
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Odp: Is ZFS efficient for large collections of small files?

2007-08-22 Thread Roch - PAE
£ukasz K writes:
   Is ZFS efficient at handling huge populations of tiny-to-small files -
   for example, 20 million TIFF images in a collection, each between 5
   and 500k in size?
   
   I am asking because I could have sworn that I read somewhere that it
   isn't, but I can't find the reference.
  
  It depends, what type of I/O you will do. If only reads, there is no 
  problem. Writting small files ( and removing ) will fragmentate pool
  and it will be a huge problem.
  You can set recordsize to 32k ( or 16k ) and it will help for some time.
  

Comparing recordsize of 16K with 128K.

Files in the range of [0,16K] : no difference.
Files in the range of [16K,128K]  : more efficient to use 128K
Files in the range of [128K,500K] : more efficient to use 16K

In the [16K,128K] range the actual filesize is rounded up to 
16K with 16K recordsize and to the nearest 512B boundary
with 128K recordsize. This will be fairly catastrophic for
files slightly above 16K (rounded up to 32K vs 16K+512B).

In the [128K, 500K] range we're hurt by this

5003563 use smaller tail block for last block of object

until   it is  fixed, then  yes , files stored using  16K
records are  rounded up more tightly. metadata probably
east parts of the gains.

-r


  Lukas
  
  
  CLUBNETIC SUMMER PARTY 2007
  House, club, electro. Najlepsza kompilacja na letnie imprezy!
  http://klik.wp.pl/?adr=http%3A%2F%2Fadv.reklama.wp.pl%2Fas%2Fclubnetic.htmlsid=1266
  
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] There is no NFS over ZFS issue

2007-06-26 Thread Roch - PAE

Regarding the bold statement 


There is no NFS over ZFS issue


What   I  mean here  is that,if  you  _do_  encounter  a
performance pathology not  linked to the NVRAM Storage/cache
flush issue then you _should_ complain or better get someone
to do an analysis of the situation.

One  should   not  assume that someobserved pathological
performance of  NFS/ZFS is widespread and due  to some known
ZFS issue about to be fixed.

To be sure, there are lots of performance opportunities that
will provide incremental  improvements the most  significant
of which ZFSSeparate  Intent Log  just  integrated  in
Nevada. This opens  up the   field   for further NFS/ZFS
performance investigations.

But the data that got this thread  started seem to highlight
an NFS   vs Samba opportinity,   something  we need  to look
into. Otherwise I don't think that the  data produced so far
has hightlighted   any specific  NFS/ZFS issue.There are
certainly   opportinitiesfor   incremental   performance
improvements but, to the best of my knowledge, outside the
NVRAM/Flush issue on certain storage :


There are no known prevalent NFS over ZFS performance
pathologies on record.


-r


Ref: 
http://mail.opensolaris.org/pipermail/zfs-discuss/2007-June/thread.html#29026


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Slow write speed to ZFS pool (via NFS)

2007-06-25 Thread Roch - PAE


Sorry about that; looks like you've hit this:

6546683 marvell88sx driver misses wakeup for mv_empty_cv
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6546683

Fixed in snv_64.
-r


Thomas Garner writes:
   We have seen this behavior, but it appears to be entirely related to the 
   hardware having the Intel IPMI stuff swallow up the NFS traffic on port 
   623 directly by the network hardware and never getting.
  
   http://blogs.sun.com/shepler/entry/port_623_or_the_mount
  
  Unfortunately, this nfs hangs across 3 separate machines, none of
  which should have this IPMI issue.  It did spur me on to dig a little
  deeper, though, so thanks for the encouragement that all may not be
  well.
  
  Can anyone debug this?  Remember that this is Nexenta Alpha 7, so it
  should be b61.  nfsd is totally hung (rpc timeouts) and zfs would be
  having problems taking snapshots, if I hadn't disabled the hourly
  snapshots.
  
  Thanks!
  Thomas
  
  [EMAIL PROTECTED] ~]$ rpcinfo -t filer0 nfs
  rpcinfo: RPC: Timed out
  program 13 version 0 is not available
  
  echo ::pgrep nfsd | ::walk thread | ::findstack -v | mdb -k
  
  stack pointer for thread 821cda00: 822d6e28
822d6e5c swtch+0x17d()
822d6e8c cv_wait_sig_swap_core+0x13f(8b8a9232, 8b8a9200, 0)
822d6ea4 cv_wait_sig_swap+0x13(8b8a9232, 8b8a9200)
822d6ee0 cv_waituntil_sig+0x100(8b8a9232, 8b8a9200, 0)
822d6f44 poll_common+0x3e1(8069480, a, 0, 0)
822d6f84 pollsys+0x7c()
822d6fac sys_sysenter+0x102()
  stack pointer for thread 821d2e00: 8c279d98
8c279dcc swtch+0x17d()
8c279df4 cv_wait_sig+0x123(8988796e, 89887970)
8c279e2c svc_wait+0xaa(1)
8c279f84 nfssys+0x423()
8c279fac sys_sysenter+0x102()
  stack pointer for thread a9f88800: 8c92e218
8c92e244 swtch+0x17d()
8c92e254 cv_wait+0x4e(8a4169ea, 8a4169e0)
8c92e278 mv_wait_for_dma+0x32()
8c92e2a4 mv_start+0x278(88252c78, 89833498)
8c92e2d4 sata_hba_start+0x79(8987d23c, 8c92e304)
8c92e308 sata_txlt_synchronize_cache+0xb7(8987d23c)
8c92e334 sata_scsi_start+0x1b7(8987d1e4, 8987d1e0)
8c92e368 scsi_transport+0x52(8987d1e0)
8c92e3a4 sd_start_cmds+0x28a(8a2710c0, 0)
8c92e3c0 sd_core_iostart+0x158(18, 8a2710c0, 8da3be70)
8c92e3f8 sd_uscsi_strategy+0xe8(8da3be70)
8c92e414 sd_send_scsi_SYNCHRONIZE_CACHE+0xd4(8a2710c0, 8c50074c)
8c92e4b0 sdioctl+0x48e(1ac0080, 422, 8c50074c, 8010, 883cee68, 0)
8c92e4dc cdev_ioctl+0x2e(1ac0080, 422, 8c50074c, 8010, 883cee68, 0)
8c92e504 ldi_ioctl+0xa4(8a671700, 422, 8c50074c, 8010, 883cee68, 0)
8c92e544 vdev_disk_io_start+0x187(8c500580)
8c92e554 vdev_io_start+0x18(8c500580)
8c92e580 zio_vdev_io_start+0x142(8c500580)
8c92e59c zio_next_stage+0xaa(8c500580)
8c92e5b0 zio_ready+0x136(8c500580)
8c92e5cc zio_next_stage+0xaa(8c500580)
8c92e5ec zio_wait_for_children+0x46(8c500580, 1, 8c50076c)
8c92e600 zio_wait_children_ready+0x18(8c500580)
8c92e614 zio_next_stage_async+0xac(8c500580)
8c92e624 zio_nowait+0xe(8c500580)
8c92e660 zio_ioctl+0x94(9c6f8300, 89557c80, 89556400, 422, 0, 0)
8c92e694 zil_flush_vdev+0x54(89557c80, 0, 0, 8c92e6e0, 9c6f8500)
8c92e6e4 zil_flush_vdevs+0x6b(8bbe46c0)
8c92e734 zil_commit_writer+0x35f(8bbe46c0, 3497c, 0, 4af5, 0)
8c92e774 zil_commit+0x96(8bbe46c0, , , 4af5, 0)
8c92e7e8 zfs_putpage+0x1e4(8c8ab480, 0, 0, 0, 0, 8c6c75c0)
8c92e824 vhead_putpage+0x95(8c8ab480, 0, 0, 0, 0, 8c6c75c0)
8c92e86c fop_putpage+0x27(8c8ab480, 0, 0, 0, 0, 8c6c75c0)
8c92e91c rfs4_op_commit+0x153(82141dd4, b28c3100, 8c92ed8c, 8c92e948)
8c92ea48 rfs4_compound+0x1ce(8c92ead0, 8c92ea7c, 0, 8c92ed8c, 0)
8c92eaac rfs4_dispatch+0x65(8bf9b248, 8c92ed8c, b28c5a40, 8c92ead0)
8c92ed10 common_dispatch+0x6b0(8c92ed8c, b28c5a40, 2, 4, 8bf9c01c, 
  8bf9b1f0)
8c92ed34 rfs_dispatch+0x1f(8c92ed8c, b28c5a40)
8c92edc4 svc_getreq+0x158(b28c5a40, 842952a0)
8c92ee0c svc_run+0x146(898878e8)
8c92ee2c svc_do_run+0x6e(1)
8c92ef84 nfssys+0x3fb()
8c92efac sys_sysenter+0x102()
  snipping out a bunch of other threads

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Slow write speed to ZFS pool (via NFS)

2007-06-21 Thread Roch - PAE

Joe S writes:
  After researching this further, I found that there are some known
  performance issues with NFS + ZFS. I tried transferring files via SMB, and
  got write speeds on average of 25MB/s.
  
  So I will have my UNIX systems use SMB to write files to my Solaris server.
  This seems weird, but its fast. I'm sure Sun is working on fixing this. I
  can't imagine running a Sun box with out NFS.
  

Call be a picky but :

There is no NFS over ZFS issue (IMO/FWIW).
There is a ZFS over NVRAM issue; well understood (not related to NFS).
There is a Samba vs NFS issue; not well understood (not related to ZFS).


This last bullet is probably better suited for
[EMAIL PROTECTED]


If ZFS is talking to storage array with NVRAM, then we have
an issue (not related to NFS) described by  :

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6462690
6462690 sd driver should set SYNC_NV bit when issuing SYNCHRONIZE CACHE 
to SBC-2 devices

The  above bug/rfe  lies  in the  SD   driver but very  much
triggered by ZFS particularly running NFS,  but not only. It
affects only NVRAM based storage and is being worked on.

If ZFS is talking to a JBOD, then the slowness is a
characteristic of NFS (not related to ZFS).

So FWIW on  JBOD, there is no  ZFS+NFS issue  in the sense
that  I  don't know   howwe couldchange ZFS  to   be
significantly better  at NFS nor  do  I know how to change  NFS 
that would help  _particularly_  ZFS.  Doesn't  mean  there is
none, I just don't know about them. So please ping me if you
highlight such an issue. So if one replaces ZFS by some other
filesystem and gets large speedup  I'm interested (make sure
the other  filesystem either runs  with  write cache off, or
flushes it on NFS commit).

So that leaves us with a Samba vs NFS issue (not related to
ZFS). We know that NFS is able to create file _at most_ at
one file per server I/O latency. Samba appears better and this is
what we need to investigate. It might be better in a way
that NFS can borrow (maybe through some better NFSV4 delegation
code) or Samba might be better by being careless with data.
If we find such an NFS improvement it will help all backend
filesystems not just ZFS.

Which is why I say: There is no NFS over ZFS issue.


-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS and Tar/Star Performance

2007-06-12 Thread Roch - PAE

Hi Seigfried, just making sure you had seen this:

http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine 

You have very fast NFS to non-ZFS runs.

That seems only possible if the  hosting OS did not sync the
data when NFS required it or the  drive in question had some
fast write caches.  If the drive did  have some FWC and  ZFS
was  still slow  using them,  that  would be  the issue with
flushing mention in the blog entry.

but also maybe there   is something to  be learned  from the
Samba and AFP results...

Takeaways:

ZFS and NFS just work together.

ZFS has an open issue with some storage array (the
issue  is *not* related   to NFS); it's being worked
on. Will need collaboration from storage vendors.

NFS is slower than direct attached. Can be very very
much slower on single threaded loads.

There are many ways to workaround the slowness but most
are just not safe for your data.

-r



Siegfried Nikolaivich writes:
  This is an old topic, discussed many times at length.  However, I  
  still wonder if there are any workarounds to this issue except  
  disabling ZIL, since it makes ZFS over NFS almost unusable (it's a  
  whole magnitude slower).  My understanding is that the ball is in the  
  hands of NFS due to ZFS's design.  The testing results are below.
  
  
  Solaris 10u3 AMD64 server with Mac client over gigabit ethernet.  The  
  filesystem is on a 6 disk raidz1 pool, testing the performance of  
  untarring (with bzip2) the Linux 2.6.21 source code.  The archive is  
  stored locally and extracted remotely.
  
  Locally
  ---
  tar xfvj linux-2.6.21.tar.bz2
  real4m4.094s,user0m44.732s,  sys 0m26.047s
  
  star xfv linux-2.6.21.tar.bz2
  real1m47.502s,   user0m38.573s,  sys 0m22.671s
  
  Over NFS
  
  tar xfvj linux-2.6.21.tar.bz2
  real48m22.685s,  user0m45.703s,  sys 0m59.264s
  
  star xfv linux-2.6.21.tar.bz2
  real49m13.574s,  user0m38.996s,  sys 0m35.215s
  
  star -no-fsync -x -v -f linux-2.6.21.tar.bz2
  real49m32.127s,  user0m38.454s,  sys 0m36.197s
  
  
  The performance seems pretty bad, lets see how other protocols fare.
  
  Over Samba
  --
  tar xfvj linux-2.6.21.tar.bz2
  real4m34.952s,   user0m44.325s,  sys 0m27.404s
  
  star xfv linux-2.6.21.tar.bz2
  real4m2.998s,user0m44.121s,  sys 0m29.214s
  
  star -no-fsync -x -v -f linux-2.6.21.tar.bz2
  real4m13.352s,   user0m44.239s,  sys 0m29.547s
  
  Over AFP
  
  tar xfvj linux-2.6.21.tar.bz2
  real3m58.405s,   user0m43.132s,  sys 0m40.847s
  
  star xfv linux-2.6.21.tar.bz2
  real19m44.212s,  user0m38.535s,  sys 0m38.866s
  
  star -no-fsync -x -v -f linux-2.6.21.tar.bz2
  real3m21.976s,   user0m42.529s,  sys 0m39.529s
  
  
  Samba and AFP are much faster, except the fsync'ed star over AFP.  Is  
  this a ZFS or NFS issue?
  
  Over NFS to non-ZFS drive
  -
  tar xfvj linux-2.6.21.tar.bz2
  real5m0.211s,user0m45.330s,  sys 0m50.118s
  
  star xfv linux-2.6.21.tar.bz2
  real3m26.053s,   user0m43.069s,  sys 0m33.726s
  
  star -no-fsync -x -v -f linux-2.6.21.tar.bz2
  real3m55.522s,   user0m42.749s,  sys 0m35.294s
  
  It looks like ZFS is the culprit here.  The untarring is much faster  
  to a single 80 GB UFS drive than a 6 disk raid-z array over NFS.
  
  
  Cheers,
  Siegfried
  
  
  PS. Getting netatalk to compile on amd64 Solaris required some  
  changes since i386 wasn't being defined anymore, and somehow it  
  thought the architecture was sparc64 for some linking steps.
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS - Use h/w raid or not? Thoughts. Considerations.

2007-05-30 Thread Roch - PAE

Torrey McMahon writes:
  Toby Thain wrote:
  
   On 25-May-07, at 1:22 AM, Torrey McMahon wrote:
  
   Toby Thain wrote:
  
   On 22-May-07, at 11:01 AM, Louwtjie Burger wrote:
  
   On 5/22/07, Pål Baltzersen [EMAIL PROTECTED] wrote:
   What if your HW-RAID-controller dies? in say 2 years or more..
   What will read your disks as a configured RAID? Do you know how to 
   (re)configure the controller or restore the config without 
   destroying your data? Do you know for sure that a spare-part and 
   firmware will be identical, or at least compatible? How good is 
   your service subscription? Maybe only scrapyards and museums will 
   have what you had. =o
  
   Be careful when talking about RAID controllers in general. They are
   not created equal! ...
   Hardware raid controllers have done the job for many years ...
  
   Not quite the same job as ZFS, which offers integrity guarantees 
   that RAID subsystems cannot.
  
   Depend on the guarantees. Some RAID systems have built in block 
   checksumming.
  
  
   Which still isn't the same. Sigh. 
  
  Yep.you get what you pay for. Funny how ZFS is free to purchase 
  isn't it?
  

With RAID level block checksumming, if the data gets
corrupted on it's way  _to_ the array, that data is lost.

With ZFS and RAID-Z or Mirroring, you will recover the
data.

-r


  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: gzip compression throttles system?

2007-05-04 Thread Roch - PAE

Ian Collins writes:
  Roch Bourbonnais wrote:
  
   with recent bits ZFS compression is now handled concurrently with many
   CPUs working on different records.
   So this load will burn more CPUs and acheive it's results
   (compression) faster.
  
  Would changing (selecting a smaller) filesystem record size have any effect?
  

If the problem is that we just have a high kernel load
compressing blocks, then probably not. If anything small
records might be a tad less efficient (thus needing more CPU).

   So the observed pauses should be consistent with that of a load
   generating high system time.
   The assumption is that compression now goes faster than when is was
   single threaded.
  
   Is this undesirable ? We might seek a way to slow down compression in
   order to limit the system load.
  
  I think you should, otherwise we have a performance throttle that scales
  with the number of cores!
  

Again I wonder to what extent the issue becomes painful due 
to lack of write throttling. Once we have that in, we should 
revisit this. 

-r

  Ian
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ARC, mmap, pagecache...

2007-05-04 Thread Roch - PAE

Manoj Joseph writes:
  Hi,
  
  I was wondering about the ARC and its interaction with the VM 
  pagecache... When a file on a ZFS filesystem is mmaped, does the ARC 
  cache get mapped to the process' virtual memory? Or is there another copy?
  

My understanding is,

The ARC does not get mapped to user space. The data ends up in the ARC
(recordsize chunks) and  in the page cache (in page chunks).
Both copies are updated on writes.

-r

  -Manoj
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] HowTo: UPS + ZFS NFS + no fsync

2007-04-27 Thread Roch - PAE

Robert Milkowski writes:
  Hello Wee,
  
  Thursday, April 26, 2007, 4:21:00 PM, you wrote:
  
  WYT On 4/26/07, cedric briner [EMAIL PROTECTED] wrote:
   okay let'say that it is not. :)
   Imagine that I setup a box:
 - with Solaris
 - with many HDs (directly attached).
 - use ZFS as the FS
 - export the Data with NFS
 - on an UPS.
  
   Then after reading the :
   http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations
   I wonder if there is a way to tell the OS to ignore the fsync flush
   commands since they are likely to survive a power outage.
  
  WYT Cedric,
  
  WYT You do not want to ignore syncs from ZFS if your harddisk is directly
  WYT attached to the server.  As the document mentioned, that is really for
  WYT Complex Storage with NVRAM where flush is not necessary.
  
  
  What??
  
  Setting zil_disable=1 has nothing to do with NVRAM in storage arrays.
  It disables ZIL in ZFS wich means that if application calls fsync() or
  opens a file with O_DSYNC, etc. then ZFS won't honor it (return
  immediatelly without commiting to stable storage).
  
  Once txg group closes data will be written to disks and SCSI write
  cache flush commands will be send.
  
  Setting zil_disable to 1 is not that bad actually, and if someone
  doesn't care to lose some last N seconds of data in case of server
  crash (however zfs itself will be consistent) it can actually speed up
  nfs operations a lot.
  

...set zil_disable...speed up nfs...at the expense of a risk
of corruption of the NFS client's view. We must never forget
this.

zil_disable is really not an option IMO.


-r

  
  -- 
  Best regards,
   Robertmailto:[EMAIL PROTECTED]
 http://milek.blogspot.com
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] HowTo: UPS + ZFS NFS + no fsync

2007-04-27 Thread Roch - PAE

Wee Yeh Tan writes:
  Robert,
  
  On 4/27/07, Robert Milkowski [EMAIL PROTECTED] wrote:
   Hello Wee,
  
   Thursday, April 26, 2007, 4:21:00 PM, you wrote:
  
   WYT On 4/26/07, cedric briner [EMAIL PROTECTED] wrote:
okay let'say that it is not. :)
Imagine that I setup a box:
  - with Solaris
  - with many HDs (directly attached).
  - use ZFS as the FS
  - export the Data with NFS
  - on an UPS.
   
Then after reading the :
http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations
I wonder if there is a way to tell the OS to ignore the fsync flush
commands since they are likely to survive a power outage.
  
   WYT Cedric,
  
   WYT You do not want to ignore syncs from ZFS if your harddisk is directly
   WYT attached to the server.  As the document mentioned, that is really for
   WYT Complex Storage with NVRAM where flush is not necessary.
  
  
   What??
  
   Setting zil_disable=1 has nothing to do with NVRAM in storage arrays.
   It disables ZIL in ZFS wich means that if application calls fsync() or
   opens a file with O_DSYNC, etc. then ZFS won't honor it (return
   immediatelly without commiting to stable storage).
  
  Wait a minute.  Are we talking about zil_disable or zfs_noflush (or
  zfs_nocacheflush)?
  The article quoted was about configuring the array to ignore flush
  commands or device specific zfs_noflush, not zil_disable.
  
  I agree that zil_disable is okay from FS view (correctness still
  depends on the application), but zfs_noflush is dangerous.
  

For me, both are dangerous.

zil_disable  can cause immense pain  

  to  applications and NFS  clients.  I don't see how anyone
  can recommend   itwithout  mentioning   the   risk  of
  application/NFS corruption.

zfs_nocacheflush is  also unsafe.  

  It opens  a risk  of pool  corruption !  But, if  you have
  *all* of your pooled data on safe NVRAM protected storage,
  and  that you  don't  find a  way to tell   the storage to
  ignore cache   flush requests, you might   want to set the
  variable temporarily until  the  SYNC_NV thing  is  sorted
  out. Then make sure,  nobody imports the tunable elsewhere
  without full understanding and  make sure noone creates  a
  new  pool with non-NVRAM storage.  Since  those things are
  not under anyones control, it's not  a good idea to spread
  these kind of recommendations.

  
  -- 
  Just me,
  Wire ...
  Blog: prstat.blogspot.com
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] cow performance penatly

2007-04-27 Thread Roch - PAE

Chad Mynhier writes:
  On 4/27/07, Erblichs [EMAIL PROTECTED] wrote:
   Ming Zhang wrote:
   
Hi All
   
I wonder if any one have idea about the performance loss caused by COW
in ZFS? If you have to read old data out before write it to some other
place, it involve disk seek.
   
  
   Ming,
  
   Lets take a pro example with a minimal performance
   tradeoff.
  
   All FSs that modify a disk block, IMO, do a full
   disk block read before anything.
  
  
  Actually, I'd say that this is the main point that needs to be made.
  If you're modifying data that was once on disk, that data had to be
  read from at some point in the past.  This is invariably true for any
  filesystem.
  

Nits, just so readers are clear about this : the read of old
data to service a write, needs only be done when handling a write
of a partial filesystem block (and the data is not cached as
mentioned). For a fixed size block database with matching ZFS
recordsize, then writes will mostly be handled without a need to
read previous data. Most FS should behave the same here.

  With traditional filesystems, that data block is rewritten in the same
  place.  If it were the case that disk blocks were always written
  immediately after being read, with no intervening I/O to cause a disk
  seek, COW would have no performance benefit over traditional
  filesystems.  (Well, this isn't true, as there are other benefits to
  be had.)
  
  But it's rarely (if ever) the case that this happens.  The modified
  block is generally written some time after the original block was
  read, with plenty of intervening I/O that leaves the disk head over
  some randome location on the platter.  So for traditional filesystems,
  the in-place write of a modified block will typically involve a disk
  seek.
  
  And a second point to be made about this is the effect of caching.
  With any filesystem, writes are cached in memory and flushed out to
  disk on a regular basis.  With traditional filesystems, flushing the
  cache involves a set of random writes on the disk, which is possibly
  going to involve a disk seek for every block written.  (In the best
  case, writes could be reordered in ascending order across the disk to
  miinimize the disk seeks, but there would still possibly be a small
  disk seek between each write.)
  
  With a COW filesystem, flushing the cache involves writing
  sequentially to disk with no intervening disk seeks.  (This assumes
  that there's enough free space on disk to avoid fragmentation.)  In
  the ideal case, this means writing to disk at full platter speed.
  This is where the main performance benefit of COW comes from.
  

yep.

  Chad Mynhier
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HowTo: UPS + ZFS NFS + no fsync

2007-04-27 Thread Roch - PAE

cedric briner writes:
   You might set zil_disable to 1 (_then_ mount the fs to be
   shared). But you're still exposed to OS crashes; those would
   still corrupt your nfs clients.
  Just to better understand ? (I know that I'm quite slow :( )
  when you say _nfs clients_ are you specifically talking of:
  - the nfs client program itself :
  (lockd, statd) meaning that you can have a stale nfs handle or other 
  things ?
  - the host acting as an nfs client
  meaning that the nfs client service works, but you would have 
  corrupt the data that the software use with nfs's mounted disk.
  

It's rather applications running on the client.
Basically, we would have data loss from application's
perspective running on client without any sign of errors. It's a bit like
having a disk that would drop a write request and  not
signal an error.

  
  If I'm digging and digging against this ZIL and NFS UFS with write 
  cache, that's because I do not understand which kind of problems that 
  can occurs. What I read in general is statement like _corruption_ of the 
  client's point of view.. but what does that means ?
  
  is the shema of what can happen is :
  - the application on the nfs client side write data on the nfs server
  - meanwhile the nfs server crashes so:
- the data are not stored
- the application on the nfs client think that the data are stored ! :(
  - when the server is up again
  - the nfs client re-see the data
  - the application on the nfs client side find itself with data in the 
  previous state of its lasts writes.
  
  Am I right ?

The scenario I see would be on the client, 
download some software (a tar file).

tar x
make

The tar succeeded with  no  errors at  all. Behind our  back
during the  tar  x,   the   server rebooted. No   big   deal
normally.   But with zil_disable   on the  server, the  make
fails, either because  some files from  the original tar are
missing or parts of files.

  
  So with ZIL:
- The application has the ability to do things in the right way. So 
  even of a nfs-server crash, the application on the nfs-client side can 
  rely on is own data.
  
  So without ZIL:
- The application has not the ability to do things in the right way. 
  And we can have a corruption of data. But that doesn't mean corruption 
  of the FS. It means that the data were partially written and some are 
  missing.

Sounds right.

  
   For the love of God do NOT do stuff like that.
   
   Just create ZFS on a pile of disks the way that we should, with the
   write cache disabled on all the disks and with redundancy in the ZPool
   config .. nothing special :

  Wh !!noo..  this is really special to me !!
  I've read and re-read many times the:
- NFS and ZFS, a fine combination
- ZFS Best Practices Guide
  and other blog without remarking such idea !
  
  I even notice the opposite recommendation
  from:
  -ZFS Best Practices Guide  ZFS Storage Pools Recommendations
  -http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_Storage_Pools_Recommendations
  where I read :
- For production systems, consider using whole disks for storage pools 
  rather than slices for the following reasons:
 + Allow ZFS to enable the disk's write cache, for those disks that 
  have write caches
 
  and from:
  -NFS and ZFS, a fine combination  Comparison with UFS
  -http://blogs.sun.com/roch/#zfs_to_ufs_performance_comparison
  where I read :
Semantically correct NFS service :
  
   nfs/ufs : 17 sec (write cache disable)
   nfs/zfs : 12 sec (write cache disable,zil_disable=0)
   nfs/zfs :  7 sec (write cache enable,zil_disable=0)
  then I can say:
that nfs/zfs with write cache enable end zil_enable is --in that 
  case-- faster
  
  So why are you recommending me to disable the write cache ?
 

For ZFS, it can work either way. Maybe the above was a typo.

  -- 
  
  Cedric BRINER
  Geneva - Switzerland
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] HowTo: UPS + ZFS NFS + no fsync

2007-04-26 Thread Roch - PAE

You might set zil_disable to 1 (_then_ mount the fs to be
shared). But you're still exposed to OS crashes; those would 
still corrupt your nfs clients.

-r


cedric briner writes:
  Hello,
  
  I wonder if the subject of this email is not self-explanetory ?
  
  
  okay let'say that it is not. :)
  Imagine that I setup a box:
- with Solaris
- with many HDs (directly attached).
- use ZFS as the FS
- export the Data with NFS
- on an UPS.
  
  Then after reading the : 
  http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#ZFS_and_Complex_Storage_Considerations
  I wonder if there is a way to tell the OS to ignore the fsync flush 
  commands since they are likely to survive a power outage.
  
  
  Ced.
  
  -- 
  
  Cedric BRINER
  Geneva - Switzerland
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: storage type for ZFS

2007-04-18 Thread Roch - PAE
Richard L. Hamilton writes:
  Well, no; his quote did say software or hardware.  The theory is apparently
  that ZFS can do better at detecting (and with redundancy, correcting) errors
  if it's dealing with raw hardware, or as nearly so as possible.  Most SANs
  _can_ hand out raw LUNs as well as RAID LUNs, the folks that run them are
  just not used to doing it.
  
  Another issue that may come up with SANs and/or hardware RAID:
  supposedly, storage systems with large non-volatile caches will tend to have
  poor performance with ZFS, because ZFS issues cache flush commands as
  part of committing every transaction group; this is worse if the filesystem
  is also being used for NFS service.  Most such hardware can be
  configured to ignore cache flushing commands, which is safe as long as
  the cache is non-volatile.
  
  The above is simply my understanding of what I've read; I could be way off
  base, of course.
   

Sounds good to me. The first point is easy to understand. If
you  rely on ZFS for   data reconstruction; carve virtual
luns out of your storage and  mirror those luns in ZFS, then it's
possible that  both  copies of  a mirrored block  end up  on a
single physical device.

Performance wise, the  ZFS  I/O scheduler might interact  in
interesting way with  the one in   the storage, but I  don't
know if this has been studied in depth.

-r


   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] zfs block allocation strategy

2007-04-18 Thread Roch - PAE

tester writes:
  Hi,
  
  quoting from zfs docs
  
  The SPA allocates blocks in a round-robin fashion from the top-level
  vdevs. A storage pool with multiple top-level vdevs allows the SPA to
  use dynamic striping to increase disk bandwidth. Since a new block may
  be allocated from any of the top-level vdevs, the SPA implements
  dynamic striping by spreading out writes across all available
  top-level vdevs 
  
  
  Now, if I need two filesystems, /protect (mirrored - 2physical) and
  /fast_unprot (striped - 3physical), is it correct that we end up with
  2 top-level vedvs. If that is the case then from the above paragraph
  does it mean that blocks for either filesystem blocks can end up in
  any of the 5 physicals. What happens to the intended protection and
  performance? I am sure, I am missing some basics here.  
  

you probably end up with 2 pools, 1 mirrored vdev, and 1
stripe of 3 vdevs.

-r

  Thanks for the clarification
   
   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] query on ZFS

2007-04-11 Thread Roch - PAE

Annie Li writes:
  Can anyone help explain what does out-of-order issue mean in the
  following segment?
  
  ZFS has a pipelined I/O engine, similar in concept to CPU pipelines. The
  pipeline operates on I/O dependency graphs and provides scoreboarding,
  priority, deadline scheduling, out-of-order issue and I/O aggregation.
  I/O loads that bring other filesystems to their knees
  http://blogs.sun.com/roller/page/bill?entry=zfs_vs_the_benchmark are
  handled with ease by the ZFS I/O pipeline.
  
  
  Thanks,
  Annie
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

As an example, it says that, even if a read was issued by an
application 'after' ZFS  had started to work  on a group  of
write  I/Os, the read could  actually issue ahead of some of
the writes.


-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS improvements

2007-04-11 Thread Roch - PAE

Gino writes:
   6322646 ZFS should gracefully handle all devices
   failing (when writing)
   
   Which is being worked on.  Using a redundant
   configuration prevents this
   from happening.
  
  What do you mean with redundant?  All our servers has 2 or 4 HBAs, 2 or 4 
  fc switches and storage arrays with redundant controllers.
  We used only RAID10 zpools but we still had them corrupted.
   

Redundant from the viewpoint of ZFS. So either zfs mirror of 
raid-z. The point of the bug is to better handle failures on 
devices in non-redundant pools. For redundant pools, ZFS is able to
self-heal problems as they arise. If you maintain redundancy 
at the storage level, then it's harder for ZFS to deal with
problems. We should still behave better than we do now thus 6322646.

Can you post your zpool status output ?

-r

   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] C'mon ARC, stay small...

2007-04-02 Thread Roch - PAE
Jason J. W. Williams writes:
  Hi Guys,
  
  Rather than starting a new thread I thought I'd continue this thread.
  I've been running Build 54 on a Thumper since Mid January and wanted
  to ask a question about the zfs_arc_max setting. We set it to 
  0x1 #4GB, however its creeping over that till our Kernel
  memory usage is nearly 7GB (::memstat inserted below).
  
  This is a database server so I was curious if the DNLC would have this
  affect over time, as it does quite quickly when dealing with small
  files? Would it be worth upgrade to Build 59?
  

Another possibility is that, there is a portion of memory
that might be in the kmem caches, ready to be reclaimed and
returned to the OS free space. Such reclaims currently only
occurs on memory shortage. I think we should do it under
some more conditions...

This might fall under:

  CrNumber: 6416757
  Synopsis: zfs should return memory eventually
  http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6416757

If you induce some  temporary memory  pressure, it would  be
nice to see if you're kernel shrinks down to ~4GB.

-r


  Thank you in advance!
  
  Best Regards,
  Jason
  
  Page SummaryPagesMB  %Tot
       
  Kernel1750044  6836   42%
  Anon  1211203  4731   29%
  Exec and libs7648290%
  Page cache 220434   8615%
  Free (cachelist)   318625  12448%
  Free (freelist)659607  2576   16%
  
  Total 4167561 16279
  Physical  4078747 15932
  
  
  On 3/23/07, Roch - PAE [EMAIL PROTECTED] wrote:
  
   With latest Nevada setting zfs_arc_max in /etc/system is
   sufficient. Playing with mdb on a live system is more
   tricky and is what caused the problem here.
  
   -r
  
   [EMAIL PROTECTED] writes:
 Jim Mauro wrote:

  All righty...I set c_max to 512MB, c to 512MB, and p to 256MB...
 
arc::print -tad
  {
   ...
  c02e29e8 uint64_t size = 0t299008
  c02e29f0 uint64_t p = 0t16588228608
  c02e29f8 uint64_t c = 0t33176457216
  c02e2a00 uint64_t c_min = 0t1070318720
  c02e2a08 uint64_t c_max = 0t33176457216
  ...
  }
c02e2a08 /Z 0x2000
  arc+0x48:   0x7b9789000 =   0x2000
c02e29f8 /Z 0x2000
  arc+0x38:   0x7b9789000 =   0x2000
c02e29f0 /Z 0x1000
  arc+0x30:   0x3dcbc4800 =   0x1000
arc::print -tad
  {
  ...
  c02e29e8 uint64_t size = 0t299008
  c02e29f0 uint64_t p = 0t268435456  -- p
  is 256MB
  c02e29f8 uint64_t c = 0t536870912  -- c
  is 512MB
  c02e2a00 uint64_t c_min = 0t1070318720
  c02e2a08 uint64_t c_max = 0t536870912--- c_max is
  512MB
  ...
  }
 
  After a few runs of the workload ...
 
arc::print -d size
  size = 0t536788992
   
 
 
  Ah - looks like we're out of the woods. The ARC remains clamped at 
   512MB.


 Is there a way to set these fields using /etc/system?
 Or does this require a new or modified init script to
 run and do the above with each boot?

 Darren

 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  
   ___
   zfs-discuss mailing list
   zfs-discuss@opensolaris.org
   http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] 6410 expansion shelf

2007-03-29 Thread Roch - PAE
Robert Milkowski writes:
  Hello Selim,
  
  Wednesday, March 28, 2007, 5:45:42 AM, you wrote:
  
  SD talking of which,
  SD what's the effort and consequences to increase the max allowed block
  SD size in zfs to highr figures like 1M...
  
  Think what would happen then if you try to read 100KB of data - due to
  chekcsumming ZFS would have to read entire MB.
  
  However it should be possible to batch several IOs together and issue
  one larger with ZFS - at least I hope it's possible.
  

As you note The max coherency unit (blocksize) in ZFS is
128K. It's also the max I/O size. And smaller I/O are
already aggregated or batched up to that size.

At 128K size the control to data ratio on the wire is already
quite reasonable. So I don't see much benefit to increasing this
(there maybe some but the context needs to be well defined).

The issue subject to debate because traditionally, one I/O
came with an implied overhead of a full head seek. In that
case, the larger the I/O the better. So at 60MB/s throughput 
and 5ms head seek time, we need I/Os 300K to make  the data
transfer time larger than the seek time and ~ 3MB I/O
sizes to reach the point of diminishing return.

But with a  write allocate scheme   we are not hit with  the
head seek for every I/O and common I/O size wisdom needs to
be reconsidered.

-r


  
  -- 
  Best regards,
   Robertmailto:[EMAIL PROTECTED]
 http://milek.blogspot.com
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] ZFS and Kstats

2007-03-27 Thread Roch - PAE

See

Kernel Statistics Library Functions kstat(3KSTAT)

-r

Atul Vidwansa writes:
  Peter,
  How do I get those stats programatically? Any clues?
  Regards,
  _Atul
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] missing features?Could/should zfs support a new ioctl, constrained if neede

2007-03-26 Thread Roch - PAE

Richard L. Hamilton writes:
  _FIOSATIME - why doesn't zfs support this (assuming I didn't just miss it)?
  Might be handy for backups.
  

Are these syscall sufficent ?

 int utimes(const char *path, const struct timeval times[2]);
 int futimesat(int fildes, const char *path, const struct timeval times[2]);

  Could/should zfs support a new ioctl, constrained if needed to files of
  zero size, that sets an explicit (and fixed) blocksize for a particular
  file?  That might be useful for performance in special cases when one
  didn't necessarily want to specify (or depend on the specification of
  perhaps) the attribute at the filesystem level.  One could imagine a
  database that was itself tunable per-file to a similar range of
  blocksizes, which would almost certainly benefit if it used those sizes
  for the corresponding files.  Additional capabilities that might be
  desirable: setting the blocksize to zero to let the system return to
  default behavior for a file; being able to discover the file's blocksize
  (does fstat() report this?) as well as whether it was fixed at the
  filesystem level, at the file level, or in default state.
  

Yep, It does look interesting. 


  Wasn't there some work going on to add real per-user (and maybe per-group)
  quotas, so one doesn't necessarily need to be sharing or automounting
  thousands of individual filesystems (slow)?  Haven't heard anything lately 
  though...
   

What is slow here is mounting all those FS at boot and
unmounting at shutdown. The most relevant project here in my 
mind is :

6478980 zfs should support automount property

which would give ZFS a mount on demand behavior. Fast boot/shutdown
and fewer mounted FS at any one time.

Then we need to make administrating many user FS as painless as
administring a single one.

-r

   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS performance with Oracle

2007-03-21 Thread Roch - PAE

JS writes:
  The big problem is that if you don't do your redundancy in the zpool,
  then the loss of a single device flatlines the system. This occurs in
  single device pools or stripes or concats. Sun support has said in
  support calls and Sunsolve docs that this is by design, but I've never
  seen the loss of any other filesystem cause a machine to halt and dump
  core. Multiple bus resets can create a condition that makes the kernel
  believe that the device is no longer available. This was a persistant
  problem, especially on Pillar, until I started using setting
  sd_max_throttle down. 

Such failures are certainly not by design and my
understanding is that it's being very actively worked on.

This said, redundancy in the zpool is a great idea.
At the least it protects the path between the filesystem and 
the storage.

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Re: ZFS memory and swap usage

2007-03-20 Thread Roch - PAE

Hi Mike, This already integrated in Nevada:

6510807 ARC statistics should be exported via kstat

kstat zfs:0:arcstats


module: zfs instance: 0 
name:   arcstatsclass:misc
c   534457344
c_max   16028893184
c_min   534457344
crtime  6301.4284957
deleted 1149800
demand_data_hits4514722
demand_data_misses  54810
demand_metadata_hits289342
demand_metadata_misses  5203
evict_skip  0
hash_chain_max  8
hash_chains 8192
hash_collisions 1243605
hash_elements   53250
hash_elements_max   250443
hits9929297
mfu_ghost_hits  3917
mfu_hits2496914
misses  60013
mru_ghost_hits  29072
mru_hits2596064
mutex_miss  4791
p   210483584
prefetch_data_hits  5125227
prefetch_data_misses0
prefetch_metadata_hits  6
prefetch_metadata_misses0
recycle_miss2338
size439890944
snaptime939404.5920782

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: ZFS memory and swap usage

2007-03-19 Thread Roch - PAE

Info on tuning the ARC was just recently updated:

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide#Memory_and_Dynamic_Reconfiguration_Recommendations

-r

Rainer Heilke writes:
  Thanks for the feedback. Please see below.
  
   ZFS should give back memory used for cache to system
   if applications are demanding it. Right it should but sometimes it
   won't.
   
   However with databases there's simple workaround - as
   you know how much ram all databases will consume at least you can
   limit ZFS's arc cache to remaining free memory (and possibly reduce
   it even more byt 2-3x factor). For details on how to do it see 'C'mon
   ARC, stay small...' thread here.
   
   So if you have 16GB RAM in a system and want 10GB for
   SGA + another 2GB for Oracle + 1GB for other kernel resources you
   are with 3GB left.
   
   So I would limit arc c_max to 3GB or even to 1GB.
  
  I was of the understanding that this kernel setting was only introduced in 
  newer Nevada builds. Does this actually work under Solaris 10, Update 3?
  
  Thanks again.
  Rainer
   
   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Re: ZFS memory and swap usage

2007-03-19 Thread Roch - PAE

Rainer Heilke writes:
  The updated information states that the kernel setting is only for the
  current Nevada build. We are not going to use the kernel debugger
  method to change the setting on a live production system (and do this
  everytime we need to reboot). 
  
  We're back to trying to set their expectations more realistically, and
  using proper tools to measure memory usage. As I stated at the outset,
  they are trying to start up a 10GB SGA database within two minutes to
  simulate the start-up of five 2GB databases at boot-up. I sincerely
  doubt they are going to start all five databases simultaneously within
  two minutes on a regular boot-up. 
  

After bootup, ZFS should have near zero memory in the
ARC. Limiting the ARC should have no effect on their startup 
times. Right ?

-r


  So, what is the best use of the OS tools (vmstat, etc.) to show them
  how this would really occur? 
  
  Rainer
   
   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS stalling problem

2007-03-12 Thread Roch - PAE


Working with a small txg_time means we are hit by the pool
sync overhead more often. This is why the per second
throughpuot has smaller peak values.

With txg_time = 5, we have another problem which is that
depending on timing of the pool sync, some txg can end up 
with too little data in them and sync quickly. We're closing 
in (I hope) on fixing both issues:


6429205 each zpool needs to monitor its throughput and throttle heavy 
writers 
6415647 Sequential writing is jumping

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=

-r


Jesse DeFer writes:
  OK, I tried it with txg_time set to 1 and am seeing less predictable
  results.  The first time I ran the test it completed in 27 seconds (vs
  24s for ufs or 42s with txg_time=5).  Further tests ran from 27s to
  43s, about half the time greater than 40s. 
  
  zpool iostat doesn't show the large no-writes gaps, but it is still very 
  bursty and peak bandwidth is lower.  Here is a 29s run:
  
  tank 113K   464G  0  0  0  0
  tank 113K   464G  0226  0  28.2M
  tank40.1M   464G  0441  0  46.9M
  tank88.2M   464G  0384  0  39.8M
  tank 136M   464G  0445  0  47.4M
  tank 184M   464G  0412  0  43.4M
  tank 232M   464G  0411  0  43.2M
  tank 272M   464G  0402  0  42.1M
  tank 320M   464G  0435  0  46.3M
  tank 368M   464G  0366  63.4K  37.7M
  tank 408M   464G  0494  0  53.6M
  tank 456M   464G  0360  0  36.8M
  tank 496M   464G  0420  0  44.5M
  tank 544M   463G  0439  0  46.8M
  tank 585M   463G  0370  0  38.2M
  tank 633M   463G  0407  0  42.6M
  tank 673M   463G  0457  0  49.0M
  tank 713M   463G  0368  0  37.9M
  tank 761M   463G  0443  0  47.2M
  tank 801M   463G  0380  63.4K  39.4M
  tank 844M   463G  0444  63.4K  47.4M
  tank 879M   463G  0184  0  14.9M
  tank 879M   463G  0339  0  33.4M
  tank 913M   463G  0215  0  26.5M
  tank 944M   463G  0393  63.4K  36.4M
  tank 976M   463G  0171  63.4K  10.5M
  tank1008M   463G  0237  63.4K  21.6M
  tank1008M   463G  0312  0  31.5M
  tank1.02G   463G  0137  0  9.05M
  tank1.05G   463G  0313  0  23.4M
  tank1.05G   463G  0  0  0  0
  
  
  Jesse
  
   Jesse,
   
   This isn't a stall -- it's just the natural rhythm of
   pushing out
   transaction groups.  ZFS collects work (transactions)
   until either
   the transaction group is full (measured in terms of
   how much memory
   the system has), or five seconds elapse -- whichever
   comes first.
   
   Your data would seem to suggest that the read side
   isn't delivering
   data as fast as ZFS can write it.  However, it's
   possible that
   there's some sort of 'breathing' effect that's
   hurting performance.
   One simple experiment you could try: patch txg_time
   to 1.  That
   will cause ZFS to push transaction groups every
   second instead of
   the default of every 5 seconds.  If this helps (or if
   it doesn't),
   please let us know.
   
   Thanks,
   
   Jeff
   
   Jesse DeFer wrote:
Hello,

I am having problems with ZFS stalling when
   writing, any help in troubleshooting would be
   appreciated.  Every 5 seconds or so the write
   bandwidth drops to zero, then picks up a few seconds
   later (see the zpool iostat at the bottom of this
   message).  I am running SXDE, snv_55b.

My test consists of copying a 1gb file (with cp)
   between two drives, one 80GB PATA, one 500GB SATA.
   The first drive is the system drive (UFS), the
   second is for data.  I have configured the data
   drive with UFS and it does not exhibit the stalling
   problem and it runs in almost half the time.  I have
   tried many different ZFS settings as well:
   atime=off, compression=off, checksums=off,
   zil_disable=1 all to no effect.  CPU jumps to about
   25% system time during the stalls, and hovers around
5% when data is being transferred.

# zpool iostat 1
   capacity operationsbandwidth
   sed  avail   read  write   read  write
--  -  -  -  -  -
-
tank 183M   464G  0 17  1.12K  1.93M
tank 183M   464G  0457  0  57.2M
tank 183M   464G  0445  0  55.7M
tank 183M   464G  0405  0  50.7M
tank 366M   464G  0226  0  4.97M
tank 366M   464G  0  0  0  0
tank 366M   464G  0  0  0  0
tank  

Re: Re[2]: [zfs-discuss] writes lost with zfs !

2007-03-12 Thread Roch - PAE

Did you run touch from a client ?

ZFS and UFS are different in general but in response to a local touch
command neither need to generate immediate I/O and in response to a client
touch both do. 

-r

Ayaz Anjum writes:
  HI !
  
  Well as per my actual post, i created a zfs file as part of Sun cluster 
  HAStoragePlus, and then disconned the FC cable, since there was no active 
  IO hence the failure of disk was not detected, then i touched a file in 
  the zfs filesystem, and it went fine, only after that when i did sync then 
  the node panicked and zfs filesystem is failed over to other node. On the 
  othernode the file i touched is not there in the same zfs file system 
  hence i am saying that data is lost. I am planning to deploy zfs in a 
  production NFS environment with above 2TB of Data where users are 
  constantly updating file. Hence my concerns about data integrity. Please 
  explain.
  
  thaks
  
  Ayaz Anjum
  
  
  
  
  Darren Dunham [EMAIL PROTECTED] 
  Sent by: [EMAIL PROTECTED]
  03/12/2007 05:45 AM
  
  To
  zfs-discuss@opensolaris.org
  cc
  
  Subject
  Re: Re[2]: [zfs-discuss] writes lost with zfs !
  
  
  
  
  
  
   I have some concerns here,  from my experience in the past, touching a 
   file ( doing some IO ) will cause the ufs filesystem to failover, unlike 
  
   zfs where it did not ! Why the behaviour of zfs different than ufs ?
  
  UFS always does synchronous metadata updates.  So a 'touch' that creates
  a file is going to require a metadata write.
  
  ZFS writes may not necessarily hit the disk until a transaction group
  flush. 
  
   is not this compromising data integrity ?
  
  It should not.  Is there a scenario that you are worried about?
  
  -- 
  Darren Dunham   [EMAIL PROTECTED]
  Senior Technical Consultant TAOShttp://www.taos.com/
  Got some Dr Pepper?   San Francisco, CA bay area
This line left intentionally blank to confuse you. 
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  
  
  
  
  
  
  
  
  
  --
   
  
  Confidentiality Notice : This e-mail  and  any attachments  are 
  confidential  to  the addressee and may also be privileged.  If  you are 
  not  the addressee of  this e-mail, you may not copy, forward, disclose or 
  otherwise use it in any way whatsoever.  If you have received this e-mail 
  by mistake,  please  e-mail  the sender by replying to this message, and 
  delete the original and any print out thereof. 
  
  brfont size=3 face=sans-serifHI !/font
  br
  brfont size=3 face=sans-serifWell as per my actual post, i created
  a zfs file as part of Sun cluster HAStoragePlus, and then disconned the
  FC cable, since there was no active IO hence the failure of disk was not
  detected, then i touched a file in the zfs filesystem, and it went fine,
  only after that when i did sync then the node panicked and zfs filesystem
  is failed over to other node. On the othernode the file i touched is not
  there in the same zfs file system hence i am saying that data is lost.
  I am planning to deploy zfs in a production NFS environment with above
  2TB of Data where users are constantly updating file. Hence my concerns
  about data integrity. Please explain./font
  br
  brfont size=3 face=sans-serifthaks/font
  br
  brfont size=3 face=sans-serifAyaz Anjumbr
  /font
  br
  br
  br
  table width=100%
  tr valign=top
  td width=40%font size=1 face=sans-serifbDarren Dunham lt;[EMAIL 
  PROTECTED]gt;/b
  /font
  brfont size=1 face=sans-serifSent by: [EMAIL PROTECTED]/font
  pfont size=1 face=sans-serif03/12/2007 05:45 AM/font
  td width=59%
  table width=100%
  tr
  td
  div align=rightfont size=1 face=sans-serifTo/font/div
  td valign=topfont size=1 
  face=sans-serifzfs-discuss@opensolaris.org/font
  tr
  td
  div align=rightfont size=1 face=sans-serifcc/font/div
  td valign=top
  tr
  td
  div align=rightfont size=1 face=sans-serifSubject/font/div
  td valign=topfont size=1 face=sans-serifRe: Re[2]: [zfs-discuss]
  writes lost with zfs !/font/table
  br
  table
  tr valign=top
  td
  td/table
  br/table
  br
  br
  brfont size=2ttgt; I have some concerns here, nbsp;from my experience
  in the past, touching a br
  gt; file ( doing some IO ) will cause the ufs filesystem to failover,
  unlike br
  gt; zfs where it did not ! Why the behaviour of zfs different than ufs
  ?br
  br
  UFS always does synchronous metadata updates. nbsp;So a 'touch' that 
  createsbr
  a file is going to require a metadata write.br
  br
  ZFS writes may not necessarily hit the disk until a transaction groupbr
  flush. nbsp;br
  br
  gt; is not this compromising data integrity ?br
  br
  It should not. nbsp;Is there a scenario that you are worried about?br
  

Re: [zfs-discuss] Re: ZFS/UFS layout for 4 disk servers

2007-03-12 Thread Roch - PAE

Frank Cusack writes:
  On March 7, 2007 8:50:53 AM -0800 Matt B [EMAIL PROTECTED] wrote:
   Any thoughts on the best practice points I am raising? It disturbs me
   that it would make a statement like don't use slices for production.
  
  I think that's just a performance thing.
  

Right, I   think what  would   be  very unoptimal   from ZFS
standpoint would be  to  configure 2 slices from  _one_ disk
into a given zpool. This would  send the I/O scheduler on a
tangent, but it would nevertheless still work.


  -frank
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS/UFS layout for 4 disk servers

2007-03-08 Thread Roch - PAE
Manoj Joseph writes:
  Matt B wrote:
   Any thoughts on the best practice points I am raising? It disturbs me
   that it would make a statement like don't use slices for
   production.
  
  ZFS turns on write cache on the disk if you give it the entire disk to 
  manage. It is good for performance. So, you should use whole disks when 
  ever possible.
  

Just a small clarification to state that the extra
performance  that comes from having the write cache on
applies mostly to disks that do not have other means of
command concurrency (NCQ, CTQ). With NCQ/CTQ, the write
cache setting should not matter much to ZFS performance.

-r

  Slices work too, but write cache for the disk will not be turned on by zfs.
  
  Cheers
  Manoj
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS stalling problem

2007-03-06 Thread Roch - PAE

Jesse, You can change txg_time with mdb

echo txg_time/W0t1 | mdb -kw


-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why number of NFS threads jumps to the max value?

2007-03-05 Thread Roch - PAE

Leon Koll writes:
  On 2/28/07, Roch - PAE [EMAIL PROTECTED] wrote:
  
  
   http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6467988
  
   NFSD  threads are created  on a  demand  spike (all of  them
   waiting  on I/O) but thentend to stick around  servicing
   moderate loads.
  
   -r
  
  Hello Roch,
  It's not my case. NFS stops to service after some point. And the
  reason is in ZFS. It never happens with NFS/UFS.
  Shortly, my scenario:
  1st SFS run, 2000 requested IOPS. NFS is fine, ;low number of threads.
  2st SFS run, 4000 requested IOPS. NFS cannot serve all requests, no of
  threads jumps to max
  3rd SFS run, 2000 requested IOPS. NFS cannot serve all requests, no of
  threads jumps to max.
  System cannot get back to the same results under equal load (1st and 3rd).
  Reboot between 2nd and 3rd doesn't help. The only persistent thing is
  a directory structure that was created during the 2nd run (in SFS
  higher requested load - more directories/files created).
  I am sure it's a bug. I need help. I don't care that ZFS works N times
  worse than UFS. I really care that after heavy load everything is
  totally screwed.
  
  Thanks,
  -- Leon

Hi Leon,

How much is the slowdown between 1st and 3rd ? How filled is 
the pool at each stage ? What does 'NFS stops to service'
mean ?

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why number of NFS threads jumps to the max value?

2007-03-05 Thread Roch - PAE

Leon Koll writes:

  On 3/5/07, Roch - PAE [EMAIL PROTECTED] wrote:
  
   Leon Koll writes:
 On 2/28/07, Roch - PAE [EMAIL PROTECTED] wrote:
 
 
  http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6467988
 
  NFSD  threads are created  on a  demand  spike (all of  them
  waiting  on I/O) but thentend to stick around  servicing
  moderate loads.
 
  -r

 Hello Roch,
 It's not my case. NFS stops to service after some point. And the
 reason is in ZFS. It never happens with NFS/UFS.
 Shortly, my scenario:
 1st SFS run, 2000 requested IOPS. NFS is fine, ;low number of threads.
 2st SFS run, 4000 requested IOPS. NFS cannot serve all requests, no of
 threads jumps to max
 3rd SFS run, 2000 requested IOPS. NFS cannot serve all requests, no of
 threads jumps to max.
 System cannot get back to the same results under equal load (1st and 
   3rd).
 Reboot between 2nd and 3rd doesn't help. The only persistent thing is
 a directory structure that was created during the 2nd run (in SFS
 higher requested load - more directories/files created).
 I am sure it's a bug. I need help. I don't care that ZFS works N times
 worse than UFS. I really care that after heavy load everything is
 totally screwed.

 Thanks,
 -- Leon
  
   Hi Leon,
  
   How much is the slowdown between 1st and 3rd ? How filled is
  
  Typical case is:
  1st: 1996 IOPS, latency  2.7
  3rd: 1375 IOPS, latency 37.9
  

The large latency increase is the  side effect of requesting
more than what can be delivered. Queue builds up and latency
follow. So  it  should  not be  the  primary  focus IMO. The
Decrease in IOPS is the primary problem.

One hypothesis is that over the life of the FS we're moving
toward spreading access to the full disk platter. We can
imagine some fragmentation hitting as well. I'm not sure
how I'd test both hypothesis.

   the pool at each stage ? What does 'NFS stops to service'
   mean ?
  
  There is a lot of error messages on the NFS(SFS) client :
  sfs352: too many failed RPC calls - 416 good 27 bad
  sfs3132: too many failed RPC calls - 302 good 27 bad
  sfs3109: too many failed RPC calls - 533 good 31 bad
  sfs353: too many failed RPC calls - 301 good 28 bad
  sfs3144: too many failed RPC calls - 305 good 25 bad
  sfs3121: too many failed RPC calls - 311 good 30 bad
  sfs370: too many failed RPC calls - 315 good 27 bad
 

Can this be timing out or queue full drops ? Might be a side 
effect of SFS requesting more than what can be delivered.

  Thanks,
  -- Leon

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Why number of NFS threads jumps to the max value?

2007-02-28 Thread Roch - PAE


http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6467988

NFSD  threads are created  on a  demand  spike (all of  them
waiting  on I/O) but thentend to stick around  servicing
moderate loads.

-r


  
  Leon Koll wrote:
   Hello, gurus
   I need your help. During the benchmark test of NFS-shared ZFS file systems 
   at some moment the number of NFS threads jumps to the maximal value, 1027 
   (NFSD_SERVERS was set to 1024). The latency also grows and the number of 
   IOPS is going down.
   I've collected the output of
   echo ::pgrep nfsd | ::walk thread | ::findstack -v | mdb -k
   that can be seen here:
   http://tinyurl.com/yrvn4z
  
   Could you please look at it and tell me what's wrong with my NFS server.
   Appreciate,
   -- Leon


   This message posted from opensolaris.org
   ___
   zfs-discuss mailing list
   zfs-discuss@opensolaris.org
   http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
 
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Efficiency when reading the same file blocks

2007-02-28 Thread Roch - PAE

Jeff Davis writes:
   On February 26, 2007 9:05:21 AM -0800 Jeff Davis
   But you have to be aware that logically sequential
   reads do not
   necessarily translate into physically sequential
   reads with zfs.  zfs
  
  I understand that the COW design can fragment files. I'm still trying to 
  understand how that would affect a database. It seems like that may be bad 
  for performance on single disks due to the seeking, but I would expect that 
  to be less significant when you have many spindles. I've read the following 
  blogs regarding the topic, but didn't find a lot of details:
  
  http://blogs.sun.com/bonwick/entry/zfs_block_allocation
  http://blogs.sun.com/realneel/entry/zfs_and_databases
   
   

Here is my take on this:

DB updates (writes) are mostly  governed by the  synchronous
write  code  path which for ZFS   means the ZIL performance.
It's already quite good  in   that it aggregatesmultiple
updates into few I/Os.  Some further improvements are in the
works.  COW, in general, simplify greatly write code path.

DB reads in a transaction  workloads  are mostly random.  If
the DB  is not cacheable the performance  will  be that of a
head seek no matter what FS is used (since we can't guess in
advance where to seek, COW nature does  not help nor hinders
performance).

DB reads in a decision workloads can benefit from good
prefetching (since here we actually know where the next
seeks will be).

-r

  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Efficiency when reading the same file blocks

2007-02-28 Thread Roch - PAE

Frank Hofmann writes:
  On Tue, 27 Feb 2007, Jeff Davis wrote:
  
  
   Given your question are you about to come back with a
   case where you are not
   seeing this?
  
  
   As a follow-up, I tested this on UFS and ZFS. UFS does very poorly: the 
   I/O rate drops off quickly when you add processes while reading the same 
   blocks from the same file at the same time. I don't know why this is, and 
   it would be helpful if someone explained it to me.
  
  UFS readahead isn't MT-aware - it starts trashing when multiple threads 
  perform reads of the same blocks. UFS readahead only works if it's a 
  single thread per file, as the readahead state, i_nextr, is per-inode 
  (and not a per-thread) state. Multiple concurrent readers trash this for 
  each other, as there's only one-per-file.
  

To qualify 'trashing', this means UFS looses tracks of the
access, considers workload as random and so does not do any
readahead.

  
   ZFS did a lot better. There did not appear to be any drop-off after the 
   first process. There was a drop in I/O rate as I kept adding processes, 
   but in that case the CPU was at 100%. I haven't had a chance to test this 
   on a bigger box, but I suspect ZFS is able to keep the sequential read 
   going at full speed (at least if the blocks happen to be written 
   sequentially).
  
  ZFS caches multiple readahead states - see the leading comment in
  usr/src/uts/common/fs/zfs/vdev_cache.c in your favourite workspace.
  

The vdev_cache is where you have the low level device level prefetch (I/O
for 8K, read 64K of whatever happens to be under the disk
head).

dmu_zfetch.c is where the logical prefetching occurs.


-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] understanding zfs/thunoer bottlenecks?

2007-02-27 Thread Roch - PAE

Jens Elkner writes:
  Currently I'm trying to figure out the best zfs layout for a thumper wrt. to 
  read AND write performance. 
  
  I did some simple mkfile 512G tests and found out, that per average ~
  500 MB/s  seems to be the maximum on can reach (tried initial default
  setup, all 46 HDDs as R0, etc.). 
  

That might be a per pool limitation due to 

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6460622

This performance feature was fixed in Nevada last week.
Workaround is to  create multiple pools with fewer disks.

Also this

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6415647

is degrading a bit the perf (guesstimate of anywhere up to
10-20%).

-r



  According to
  http://www.amd.com/us-en/assets/content_type/DownloadableAssets/ArchitectureWP_062806.pdf
  I would assume, that much more and at least in theory a max. ~ 2.5
  GB/s should be possible with R0 (assuming the throughput for a single
  thumper HDD is ~ 54 MB/s)... 
  
  Is somebody able to enlighten me?
  
  Thanx,
  jel.
   
   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Perforce on ZFS

2007-02-21 Thread Roch - PAE

So Jonathan, you have a concern about the on-disk space
efficiency for small file (more or less subsector). It is a
problem that we can throw rust at. I am not sure if this is
the basis of Claude's concern though.

Creating small files, last week I did a small test. With ZFS
I can create 4600 files  _and_ sync up the  pool to disk and
saw no more than 500 I/Os.  I'm no FS  expert but this looks
absolutely amazing to  me (ok,  I'm rather enthousiastic  in
general).  Logging UFS needs 1 I/O per file (so ~10X more for
my test).  I don't know where  other filesystems are on that
metric.

I also pointed out that ZFS is not too CPU efficient at tiny 
write(2) syscalls. But this inefficiency rescinds around 8K writes.
This here is a CPU benchmark (I/O is non-factor) :

CHUNK   ZFS vz UFS

1B  4X slower
1K  2X slower
8K  25% slower
32K equal
64K 30% faster

Waiting for a more specific problem statement, I can only
stick to what I said, I know of no small file problems with
ZFS; If there is one, I'd just like to see the data.

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] Re: Perforce on ZFS

2007-02-20 Thread Roch - PAE

Sorry to insist  but I am not  aware of a small file problem
with  ZFS (which doesn't mean there   isn't one, nor that we
agree on definition of 'problem'). So  if anyone has data on
this topic, I'm interested.

Also note, ZFS does a lot more than VxFS.

-r

Claude Teissedre writes:
  Hello Roch,
  
  Thanks for your reply. According to Iozone and Filebench 
  (http://blogs.sun.com/dom/), ZFS is less performant than VxFS for smalll 
  files and more performant for large files. In you blog, I don't see 
  specific infos related to small files -but it's a very interesting blog.
  
  Any help from CC: people related to Perforce benchmark (not in 
  techtracker) is welcome.
  
  Thanks,
  Clausde
  
  Roch - PAE a écrit :
   Salut Claude.
   For this kind of query, try zfs-discuss@opensolaris.org;
   Looks like a common workload to me.
   I know of no small file problem with ZFS.
   You might want to state your metric of success ?
  
   -r
  
   Claude Teissedre writes:
 Hello,
 
 I am looking for any benchmark of Perforce on ZFS.
 My need here is specifically for Perforce, a source manager. At my ISV, 
   it handles 250 users simustaneously (15 instances on average) 
 and 16 Millions (small) files. That's an area not covered in the 
   benchmaks I have seen.
 
 Thanks, Claude
 
 
  
 
  

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: SPEC SFS benchmark of NFS/ZFS/B56 - please help to improve it!

2007-02-19 Thread Roch - PAE

Leon Koll writes:
  An update:
  
  Not sure is it related to the fragmentation, but I can say that serious 
  performance degradation in my NFS/ZFS benchmarks is a result of on-disk ZFS 
  data layout.
  Read operations on directories (NFS3 readdirplus) are abnormally time 
  consuming . That kills the server. After cold restart of the host the 
  performans is still on the flour. 
  My conclusion: it's not CPU, not memory, it's ZFS on-disk structures.
   
   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


As I understand the issue, a readdirplus is
2X slower when data is already cached in the client than
when it is not.

Given that the on-disk structure does not change between the 
2 runs, I can't really place the fault on it.

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Is ZFS file system supports short writes ?

2007-02-19 Thread Roch - PAE

dudekula mastan writes:
  If a write call attempted to write X bytes of data, and if writecall writes 
  only x ( hwere x X) bytes, then we call that write as short write.
 
-Masthan

What kind of support do you want/need ?

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Not about Implementing fbarrier() on ZFS

2007-02-13 Thread Roch - PAE
Erblichs writes:
  Jeff Bonwick,
  
   Do you agree that their is a major tradeoff of
   builds up a wad of transactions in memory?
  
   We loose the changes if we have an unstable
   environment.
  
   Thus, I don't quite understand why a 2-phase
   approach to commits isn't done. First, take the
   transactions as they come and do a minimal amount
   of a delayed write. If the number of transactions
   build up, then convert to the delayed write scheme.
  

I probably don't understand the  proposition.  It seems that
this is about making all writes synchronous and initially go
through the Zil and then convert  to the pool sync when load
builds up ?  The  problem is that if we  make all  writes go
through the synchronous Zil,   this   will limit  the   load
greatly in a way that we'll never build a backlog (unless we
scale to  1000s of threads). So is  this  about an option to
enable O_DSYNC for all files ?


   This assumption is that not all ZFS envs are write
   heavy versus write once and read-many type accesses.
   My assumption is that attribute/meta reading
   outweighs all other accesses.
   
   Wouldn't this approach allow minimal outstanding
   transactions and favor read access. Yes, the assumption
   is that once the wad is started, the amount of writing
   could be substantial and thus the amount of available
   bandwidth for reading is reduced. This would then allow
   for a more N states to be available. Right?

So the reads  _are_ prioritized over  pool writes by the I/O
scheduler. But it is correct  that the pool sync does impact
the read  latency atleast  on JBOD.  There  already  are
suggestions on reducing the impact (reserved read slots, and
throttlingwriters,...).  Also  for   the  next build the
overhead of the  pool  sync is reduced  which opens   up the
possibility of testing with smaller txg_time.

I would be interested to know the problems you have observed
to see if we're covered.

  
   Second, their are a multiple uses  of then: (then pushes,
   then flushes all disk..., then writes the new uberblock,
   then flushes the caches again), in which seems to have
   some level of possible parallelism which should reduce the
   latency from the start to the final write. Or did you just
   say that for simplicity sake?
  

The parallelism level of those operations seems very high to
me and it was improved last week (for  the tail end of the
pool sync). But note that the pool sync does not commonly
hold up a write or a zil commit. It does so only when the
storage is saturated for 10s of seconds. Given that memory
is finite we have to throttle applications at some point.


-r

   Mitchell Erblich
   ---
   
  
  Jeff Bonwick wrote:
   
   Toby Thain wrote:
I'm no guru, but would not ZFS already require strict ordering for its
transactions ... which property Peter was exploiting to get fbarrier()
for free?
   
   Exactly.  Even if you disable the intent log, the transactional nature
   of ZFS ensures preservation of event ordering.  Note that disk caches
   don't come into it: ZFS builds up a wad of transactions in memory,
   then pushes them out as a transaction group.  That entire group will
   either commit or not.  ZFS writes all the new data to new locations,
   then flushes all disk write caches, then writes the new uberblock,
   then flushes the caches again.  Thus you can lose power at any point
   in the middle of committing transaction group N, and you're guaranteed
   that upon reboot, everything will either be at state N or state N-1.
   
   I agree about the usefulness of fbarrier() vs. fsync(), BTW.  The cool
   thing is that on ZFS, fbarrier() is a no-op.  It's implicit after
   every system call.
   
   Jeff
   ___
   zfs-discuss mailing list
   zfs-discuss@opensolaris.org
   http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-13 Thread Roch - PAE
Peter Schuller writes:
   I agree about the usefulness of fbarrier() vs. fsync(), BTW.  The cool
   thing is that on ZFS, fbarrier() is a no-op.  It's implicit after
   every system call.
  
  That is interesting. Could this account for disproportionate kernel
  CPU usage for applications that perform I/O one byte at a time, as
  compared to other filesystems? (Nevermind that the application
  shouldn't do that to begin with.)

I just quickly measured this (overwritting files in CHUNKS);
This is a software benchmark (I/O is non-factor)

CHUNK   ZFS vz UFS

1B  4X slower
1K  2X slower
8K  25% slower
32K equal
64K 30% faster

Quick and dirty but I think it paints a picture.
I can't really answer your question though.

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: ZFS vs NFS vs array caches, revisited

2007-02-13 Thread Roch - PAE

The only obvious thing would be if the exported ZFS
filesystems where initially mounted at a point in time when
zil_disable was non-null.

The stack trace that is relevant is:

  sd_send_scsi_SYNCHRONIZE_CACHE
  sd`sdioctl+0x1770
  zfs`vdev_disk_io_start+0xa0
  zfs`zil_flush_vdevs+0x108
  zfs`zil_commit_writer+0x2b8
...

You might want to try in turn:

dtrace -n 'sd_send_scsi_SYNCHRONIZE_CACHE:[EMAIL 
PROTECTED](20)]=count()}'

dtrace -n 'sdioctl:[EMAIL PROTECTED](20)]=count()}'

dtrace -n zil_flush_vdevs:[EMAIL PROTECTED](20)]=count()}'

dtrace -n zil_commit_writer:[EMAIL PROTECTED](20)]=count()}'

And see if you loose your footing along the way.


-r


Marion Hakanson writes:
  [EMAIL PROTECTED] said:
   [b]How the ZFS striped on 7 slices of FC-SATA LUN via NFS worked [u]146 
   times
   faster[/u] than the ZFS on 1 slice of the same LUN via NFS???[/b] 
  
  Well, I do have more info to share on this issue, though how it worked
  faster in that test still remains a mystery.  Folks may recall that I said:
  
   Not that I'm not complaining, mind you.  I appear to have stumbled across a
   way to get NFS over ZFS to work at a reasonable speed, without making 
   changes
   to the array (nor resorting to giving ZFS SVN soft partitions instead of
   real devices).  Suboptimal, mind you, but it's workable if our Hitachi
   folks don't turn up a way to tweak the array.
  
  Unfortunately, I was wrong.  I _don't_ know how to make it go fast.  While
  I _have_ been able to reproduce the result on a couple different LUN/slice
  configurations, I don't know what triggers the fast behavior.  All I can
  say for sure is that a little dtrace one-liner that counts sync-cache calls
  turns up no such calls (for both local ZFS and remote NFS extracts) when
  things are going fast on a particular filesystem.
  
  By comparison, a local ZFS tar-extraction triggers 12 sync-cache calls,
  and one hits 288 such calls during an NFS extraction before interrupting
  the run after 30 seconds (est. 1/100th of the way through) when things
  are working in the slow mode.  Oh yeah, here's the one-liner (type in
  the command, run your test in another session, then hit ^C on this one):
  
dtrace -n fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry'[EMAIL PROTECTED] = 
  count()}'
  
  This is my first ever use of dtrace, so please be gentle with me (:-).
  
  
  [EMAIL PROTECTED] said:
   Guess I should go read the ZFS source code (though my 10U3 surely lags the
   Opensolaris stuff). 
  
  I did go read the source code, for my own edification.  To reiterate what
  was said earlier:
  
  [EMAIL PROTECTED] said:
   The point is that the flushes occur whether or not ZFS turned the caches on
   or not (caches might be turned on by some other means outside the 
   visibility
   of ZFS). 
  
  My limited reading of ZFS (on opensolaris.org site) code so far has turned
  up no obvious way to make ZFS skip the sync-cache call.  However my dtrace
  test, unless it's flawed, shows that on some filesystems, the call is made,
  and on other filesystems the call is not made.
  
  
  [EMAIL PROTECTED] said:
   2.I never saw the storage controller with cache-per-LUN setting. Cache size
   doesn't depend on number of LUNs IMHO, it's a fixed size per controller or
   per FC port, SAN-experts-please-fix-me-if-I'm-wrong. 
  
  Robert has already mentioned array cache being reserved on a per-LUN basis
  in Symmetrix boxes.  Our low-end HDS unit also has cache pre-fetch settings
  on a per-LUN basis (defaults according to number of disks in RAID-group).
  
  Regards,
  
  Marion
  
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Re: ZFS vs NFS vs array caches, revisited

2007-02-13 Thread Roch - PAE

On x86 try with sd_send_scsi_SYNCHRONIZE_CACHE

Leon Koll writes:
  Hi Marion,
  your one-liner works only on SPARC and doesn't work on x86: 
  # dtrace -n fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry'[EMAIL PROTECTED] = 
  count()}'
  dtrace: invalid probe specifier fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:[EMAIL 
  PROTECTED] = count()}: probe description 
  fbt::ssd_send_scsi_SYNCHRONIZE_CACHE:entry does not match any probes
  
  What's wrong with it?
  Thanks,
  [i]-- leon[/i]
   
   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?

2007-02-12 Thread Roch - PAE

Robert Milkowski writes:
  bash-3.00# dtrace -n fbt::txg_quiesce:return'{printf(%Y ,walltimestamp);}'
  dtrace: description 'fbt::txg_quiesce:return' matched 1 probe
  CPU IDFUNCTION:NAME
3  38168   txg_quiesce:return 2007 Feb 12 14:08:15 
0  38168   txg_quiesce:return 2007 Feb 12 14:12:14 
3  38168   txg_quiesce:return 2007 Feb 12 14:15:05 
  ^C
  
  
  
  Why I do not see it exactly every 5s?
  On other server I get output exactly every 5s.
   
   

I am not sure about this specific funtion but if the
question is the same as why is the pool synching more often
than 5sec, then that can be because of low memory condition
(if we have too much dirty memory to sync we don't wait the
5 seconds.). See arc_tempreserve_space around (ERESTART).

-r




  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] Re: NFS/ZFS performance problems - txg_wait_open() deadlocks?

2007-02-12 Thread Roch - PAE

Duh!.

Long sync (which delays the next  sync) are also possible on
a write intensive workloads. Throttling heavy writters, I
think, is the key to fixing this.

Robert Milkowski writes:
  Hello Roch,
  
  Monday, February 12, 2007, 3:19:23 PM, you wrote:
  
  RP Robert Milkowski writes:
bash-3.00# dtrace -n fbt::txg_quiesce:return'{printf(%Y 
  ,walltimestamp);}'
dtrace: description 'fbt::txg_quiesce:return' matched 1 probe
CPU IDFUNCTION:NAME
  3  38168   txg_quiesce:return 2007 Feb 12 14:08:15 
  0  38168   txg_quiesce:return 2007 Feb 12 14:12:14 
  3  38168   txg_quiesce:return 2007 Feb 12 14:15:05 
^C



Why I do not see it exactly every 5s?
On other server I get output exactly every 5s.
 
 
  
  RP I am not sure about this specific funtion but if the
  RP question is the same as why is the pool synching more often
  RP than 5sec, then that can be because of low memory condition
  RP (if we have too much dirty memory to sync we don't wait the
  RP 5 seconds.). See arc_tempreserve_space around (ERESTART).
  
  The opposite - why it's not syncing every 5s and rather every
  few minutes on that server.
  
  
  -- 
  Best regards,
   Robertmailto:[EMAIL PROTECTED]
 http://milek.blogspot.com
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


[zfs-discuss] RealNeel : ZFS and DB performance

2007-02-09 Thread Roch - PAE

It's just a matter of time before ZFS overtakes UFS/DIO 
for DB loads, See Neel's new blog entry:

http://blogs.sun.com/realneel/entry/zfs_and_databases_time_for

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: Re[2]: [zfs-discuss] se3510 and ZFS

2007-02-07 Thread Roch - PAE

Robert Milkowski writes:
  Hello Jonathan,
  
  Tuesday, February 6, 2007, 5:00:07 PM, you wrote:
  
  JE On Feb 6, 2007, at 06:55, Robert Milkowski wrote:
  
   Hello zfs-discuss,
  
 It looks like when zfs issues write cache flush commands se3510
 actually honors it. I do not have right now spare se3510 to be 100%
 sure but comparing nfs/zfs server with se3510 to another nfs/ufs
 server with se3510 with Periodic Cache Flush Time set to disable
 or so longer time I can see that cache utilization on nfs/ufs stays
 about 48% while on nfs/zfs it's hardly reaches 20% and every few
 seconds goes down to 0 (I guess every txg_time).
  
 nfs/zfs also has worse performance than nfs/ufs.
  
 Does anybody know how to tell se3510 not to honor write cache flush
 commands?
  
  JE I don't think you can .. DKIOCFLUSHWRITECACHE *should* tell the array
  JE to flush the cache.  Gauging from the amount of calls that zfs makes to
  JE this vs ufs (fsck, lockfs, mount?) - i think you'll see the  
  JE performance diff,
  JE particularly when you hit an NFS COMMIT.  (If you don't use vdevs you
  JE may see another difference in zfs as the only place you'll hit is on  
  JE the zil)
  
  IMHO it definitely shouldn't actually. The array has two controllers
  and write cache is mirrored. Also this is not the only host using that
  array. You can actually win much of a performance, especially with
  nfs/zfs setup (lot of synchronous ops) I guess.
  

Again it's a question of semantic. The intent of ZFS is to
say put the bits on stable storage. The controller then
decides if the NVRAM qualifies as stable storage (is dual ported,
batteries are up) and can ignore the request. If the battery 
charge gets low then the array needs to honor the flush request.
No way for ZFS to adjust to array battery charge.

So I'd argue the DKIOCFLUSHWRITECACHE is misnamed.  The work
going   on  is to   allow  the   DKIOCFLUSHWRITECACHE  to be
qualified to mean  flush to rust  which  I guess won't  be
used by  ZFS or flush  to stable  storage; if  NVRAM is
considered stable enough by the array, then it will turn the 
request into a noop.

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Actual (cache) memory use of ZFS?

2007-01-30 Thread Roch - PAE

Bjorn Munch writes:
  Hello,
  
  I am doing some tests using ZFS for the data files of a database
  system, and ran into memory problems which has been discussed in a
  thread here a few weeks ago.
  
  When creating a new database, the data files are first initialized to
  their configured size (written in full), then the servers are started.
  They will then need to allocate shared memory for database cache.  I
  am running two database nodes per host, trying to use 512Mb memory
  each.
  
  They are using so-called Intimate Shared Memory which requires that
  the requested amount is available in physical memory.  Since ZFS has
  just gobbled up memory for cache, it is not available and the database
  won't start.
  
  This was on a host with 2Gb memory.
  

That seems like a bug. ZFS is designed to release memory
upon demand by the DB. Which OS was this running ?
Could be related to :

MrNumber: 4034947
Synopsis: anon_swap_adjust(),  anon_resvmem() should
  call kmem_reap() if availrmem is low.
Fixed in snv_42

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Thumper Origins Q

2007-01-30 Thread Roch - PAE

Nicolas Williams writes:
  On Tue, Jan 30, 2007 at 06:32:14PM +0100, Roch - PAE wrote:
 The only benefit of using a HW RAID controller with ZFS is that it
 reduces the I/O that the host needs to do, but the trade off is that ZFS
 cannot do combinatorial parity reconstruction so that it could only
 detect errors, not correct them.  It would be cool if the host could
 offload the RAID I/O to a HW controller but still be able to read the
 individual stripes to perform combinatorial parity reconstruction.
   
   right but in this situation, if the cosmic ray / bit flip hits on the
   way to the controller, the array will store wrong data and
   we will not be able to reconstruct the correct block.
   
   So having multiple I/Os here improves the time to data
   loss metric.
  
  You missed my point.  Assume _new_ RAID HW that allows the host to read
  the individual stripes.  The ZFS could offload I/O to the RAID HW but,
  when a checksum fails to validate on read, THEN go read the individual
  stripes and parity and do the combinatorial reconstruction as if the
  RAID HW didn't exist.
  
  I don't believe such RAID HW exists, therefore the point is moot.  But
  if such HW ever comes along...
  
  Nico
  -- 

I think I got the point. Mine was that if the data travels a 
single time toward the storage and is corrupted along the
way then there will be no hope of recovering it since the
array was given bad data. Having the data travel twice is a
benefit for MTTDL.

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: ZFS or UFS - what to do?

2007-01-29 Thread Roch - PAE
Anantha N. Srirama writes:
  Agreed, I guess I didn't articulate my point/thought very well. The
  best config is to present JBoDs and let ZFS provide the data
  protection. This has been a very stimulating conversation thread; it
  is shedding new light into how to best use ZFS. 
   
   

I would say:

To enable the unique ZFS feature of self-healing
ZFS must be allowed to manage a level of
redundancy: mirroring or Raid-z.

The  type  of LUNs   (JBOD/Raid-*/iscsi) used is not
relevant in this statement.

Now, if  one also relies on ZFS  to  reconstruct data in the
face of  disk failures (as opposed   tostorage based
reconstruction), better make  sure that  single/double  disk
failures do not bring down multiple LUNS at once. So better
protection is achieved by configuring LUNS that maps to
seggregated sets of physical things (disks  controllers).

-r

  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-24 Thread Roch - PAE
[EMAIL PROTECTED] writes:
   Note also that for most applications, the size of their IO operations
   would often not match the current page size of the buffer, causing
   additional performance and scalability issues.
  
  Thanks for mentioning this, I forgot about it.
  
  Since ZFS's default block size is configured to be larger than a page,
  the application would have to issue page-aligned block-sized I/Os.
  Anyone adjusting the block size would presumably be responsible for
  ensuring that the new size is a multiple of the page size.  (If they
  would want Direct I/O to work...)
  
  I believe UFS also has a similar requirement, but I've been wrong
  before.
  

I believe the UFS requirement is that the I/O be sector
aligned for DIO to be attempted. And Anton did mention that
one of the benefit of DIO is the ability to direct-read a
subpage block. Without UFS/DIO the OS is required to read and
cache the full page and the extra amount of I/O may lead to
data channel saturation (I don't see latency as an issue in
here, right ?).

This is where I said that such a feature would translate
for ZFS into the ability to read parts of a filesystem block 
which would only make sense if checksums are disabled.

And for RAID-Z that could mean avoiding I/Os to each disks but 
one in a group, so that's a nice benefit.

So  for the  performance  minded customer that can't  afford
mirroring, is not  much a fan  of data integrity, that needs
to do subblock reads to an  uncacheable workload, then I can
see a feature popping up. And this feature is independant on
whether   or not the data  is  DMA'ed straight into the user
buffer.

The  other  feature,  is to  avoid a   bcopy by  DMAing full
filesystem block reads straight into user buffer (and verify
checksum after). The I/O is high latency, bcopy adds a small
amount. The kernel memory can  be freed/reuse straight after
the user read  completes. This is  where I ask, how much CPU
is lost to the bcopy in workloads that benefit from DIO ?

At this point, there are lots of projects  that will lead to
performance improvements.  The DIO benefits seems like small
change in the context of ZFS.

The quickest return on  investement  I see for  the  directio
hint would be to tell ZFS to not grow the ARC when servicing
such requests.


-r



  -j
  
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Re: Heavy writes freezing system

2007-01-18 Thread Roch - PAE

If some aspect  of the load is writing  large amount of data
into the pool (through  the memory cache,  as opposed to the
zil)  and that leads  to a frozen system,  I  think that a
possible contributor should be:

|6429205||each zpool needs to monitor its throughput and throttle heavy 
writers|

-r

Anantha N. Srirama writes:
  Bug 6413510 is the root cause. ZFS maestros please correct me if I'm quoting 
  an incorrect bug.
   
   
  This message posted from opensolaris.org
  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: Heavy writes freezing system

2007-01-18 Thread Roch - PAE

Jason J. W. Williams writes:
  Hi Anantha,
  
  I was curious why segregating at the FS level would provide adequate
  I/O isolation? Since all FS are on the same pool, I assumed flogging a
  FS would flog the pool and negatively affect all the other FS on that
  pool?
  
  Best Regards,
  Jason
  

Good point, If the problem is

6413510 zfs: writing to ZFS filesystem slows down fsync() on other files

Then the seggegration to 2 filesystem on the same pool will
help.

But if the problem is more like

6429205 each zpool needs to monitor its throughput and throttle heavy 
writers

then it 2 FS won't help. 2 pools probably would though.

-r


___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] Re: ZFS direct IO

2007-01-15 Thread Roch - PAE

Jonathan Edwards writes:
  
  On Jan 5, 2007, at 11:10, Anton B. Rang wrote:
  
   DIRECT IO is a set of performance optimisations to circumvent  
   shortcomings of a given filesystem.
  
   Direct I/O as generally understood (i.e. not UFS-specific) is an  
   optimization which allows data to be transferred directly between  
   user data buffers and disk, without a memory-to-memory copy.
  
   This isn't related to a particular file system.
  
  
  true .. directio(3) is generally used in the context of *any* given  
  filesystem to advise it that an application buffer to system buffer  
  copy may get in the way or add additional overhead (particularly if  
  the filesystem buffer is doing additional copies.)  You can also look  
  at it as a way of reducing more layers of indirection particularly if  
  I want the application overhead to be higher than the subsystem  
  overhead.  Programmatically .. less is more.

Direct IO makes good sense when the target disk sectors are
set a priori. But in the context of ZFS, would you rather
have 10 direct disk I/Os or 10 bcopies and 2 I/O (say that
was possible).

As for read, I  can see that when  the load is cached in the
disk array and we're running  100% CPU, the extra copy might
be noticeable. Is this the   situation that longs for DIO  ?
What % of a system is spent in the copy  ? What is the added
latency that comes from the copy ? Is DIO the best way to
reduce the CPU cost of ZFS ?

The  current Nevada  code base  has  quite nice  performance
characteristics  (and  certainly   quirks); there are   many
further efficiency gains to be reaped from ZFS. I just don't
see DIO on top of  that list for now.   Or at least  someone
needs to  spell out what  is ZFS/DIO and  how much better it
is expected to be (back of the envelope calculation accepted).

Reading RAID-Z  subblocks on filesystems that  have checksum
disabled might be interesting.   That would avoid  some disk
seeks.To served  the  subblocks directly   or  not is  a
separate matter; it's  a small deal  compared to the feature
itself.  How about disabling the  DB  checksum (it can't fix
the block anyway) and do mirroring ?

-r


  ___
  zfs-discuss mailing list
  zfs-discuss@opensolaris.org
  http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Re: [zfs-discuss] NFS and ZFS, a fine combination

2007-01-09 Thread Roch - PAE

Dennis Clarke writes:
  
   On Mon, Jan 08, 2007 at 03:47:31PM +0100, Peter Schuller wrote:
  http://blogs.sun.com/roch/entry/nfs_and_zfs_a_fine
  
   So just to confirm; disabling the zil *ONLY* breaks the semantics of
   fsync()
   and synchronous writes from the application perspective; it will do
   *NOTHING*
   to lessen the correctness guarantee of ZFS itself, including in the case
   of a
   power outtage?
  
   That is correct.  ZFS, with or without the ZIL, will *always* maintain
   consistent on-disk state and will *always* preserve the ordering of
   events on-disk.  That is, if an application makes two changes to the
   filesystem, first A, then B, ZFS will *never* show B on-disk without
   also showing A.
  
  
So then, this begs the question Why do I want this ZIL animal at all?
  

You said correctness guarantee
Bill said ...consistent on-disk state


The ZIL is not necessary for ZFS to keep it's on-disk format 
consistent. However the ZIL is necessary/essential to
provide synchonous semantics to application. Without a ZIL
fsync() and the like become a NO-OP; it's a very uncommon
requirement altough one that does exists. But for ZFS to be
a correct Filesystem, the ZIL is necessary and provides an
excellent service.

My article shows that ZFS with the ZIL can be better than
UFS (which uses it's own logging scheme).

-r

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


  1   2   >