from:"Bart Smaalders"

Re: [zfs-discuss] benefits of zfs root over ufs root

2010-04-01 Thread Bart Smaalders


On 03/31/10 17:53, Erik Trimble wrote:

Brett wrote:

Hi Folks,

Im in a shop thats very resistant to change. The management here are
looking for major justification of a move away from ufs to zfs for
root file systems. Does anyone know if there are any
whitepapers/blogs/discussions extolling the benefits of zfsroot over
ufsroot?

Regards in advance
Rep

I can't give you any links, but here's a short list of advantages:

(1) all the standard ZFS advantages over UFS
(2) LiveUpgrade/beadm related improvements
(a) much faster on ZFS
(b) don't need dedicated slice per OS instance, so it's far simpler to
have N different OS installs
(c) very easy to keep track of which OS instance is installed where
WITHOUT having to mount each one
(d) huge space savings (snapshots save lots of space on upgrades)
(3) much more flexible swap space allocation (no hard-boundary slices)
(4) simpler layout of filesystem partitions, and more flexible in
changing directory size limits (e.g. /var )
(5) mirroring a boot disk is simple under ZFS - much more complex under
SVM/UFS
(6) root-pool snapshots make backups trivially easy





ZFS root will be the supported root filesystem for Solaris Next; we've
been using it for OpenSolaris for a couple of years.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
bart.smaald...@oracle.com   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs diff

2010-03-29 Thread Bart Smaalders


On 03/29/10 16:44, Mike Gerdts wrote:

On Mon, Mar 29, 2010 at 5:39 PM, Nicolas Williams
nicolas.willi...@sun.com  wrote:

One really good use for zfs diff would be: as a way to index zfs send
backups by contents.


Or to generate the list of files for incremental backups via NetBackup
or similar.  This is especially important for file systems will
millions of files with relatively few changes.



Or to say keep indexing files on your desktop
This gives everyone a way to access the changes in a filesystem
order (number of files changed) instead of order(number of files extant).

- Bart


--
Bart Smaalders  Solaris Kernel Performance
bart.smaald...@oracle.com   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] snv_133 - high cpu

2010-02-24 Thread Bart Smaalders


On 02/23/10 15:20, Chris Ridd wrote:


On 23 Feb 2010, at 19:53, Bruno Sousa wrote:


The system becames really slow during the data copy using network, but i copy 
data between 2 pools of the box i don't notice that issue, so probably i may be 
hitting some sort of interrupt conflit in the network cards...This system is 
configured with alot of interfaces, being :

4 internal broadcom gigabit
1 PCIe 4x, Intel Dual Pro gigabit
1 PCIe 4x, Intel 10gbE card
2 PCIe 8x Sun non-raid HBA


With all of this, is there any way to check if there is indeed an interrupt 
conflit or some other type of conflit that leads this high load? I also noticed 
some messages about acpi..can this acpi also affect the performance of the 
system?


To see what interrupts are being shared:

# echo ::interrupts -d | mdb -k

Running intrstat might also be interesting.

Cheers,

Chris
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Is this using mpt driver?  There's an issue w/ the fix for
6863127 that causes performance problems on larger memory
machines, filed as 6908360.

- Bart







--
Bart Smaalders  Solaris Kernel Performance
ba...@cyber.eng.sun.com http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] snv_133 - high cpu

2010-02-24 Thread Bart Smaalders


On 02/24/10 12:57, Bruno Sousa wrote:

Yes i'm using the mtp driver . In total this system has 3 HBA's, 1
internal (Dell perc), and 2 Sun non-raid HBA's.
I'm also using multipath, but if i disable multipath i have pretty much
the same results..

Bruno



From what I understand, the fix is expected very soon; your 
performance is getting killed by the over-aggressive use of

bounce buffers...

- Bart


--
Bart Smaalders  Solaris Kernel Performance
ba...@cyber.eng.sun.com http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs related google summer of code ideas - your vote

2009-03-06 Thread Bart Smaalders




I would really like to see a feature like 'zfs diff f...@snap1 f...@othersnap'
that would report the paths of files that have either been added, deleted,
or changed between snapshots. If this could be done at the ZFS level instead
of the application level it would be very cool.

--


AFAIK, this is being actively developed, w/ a prototype working...

- Bart

--
Bart Smaalders  Solaris Kernel Performance
ba...@cyber.eng.sun.com http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Pause Solaris with ZFS compression busy by doing a cp?

2008-05-22 Thread Bart Smaalders

Neil Perrin wrote:
 I also noticed (perhaps by design) that a copy with compression off almost
 instantly returns, but the writes continue LONG after the cp process claims
 to be done. Is this normal?
 
 Yes this is normal. Unless the application is doing synchronous writes
 (eg DB) the file will be written to disk at the convenience of the FS.
 Most fs operate this way. It's too expensive to synchronously write
 out data, so it's batched up and written asynchronously.
 
 Wouldn't closing the file ensure it was written to disk?
 
 No.
 
 Is that tunable somewhere?
 
 No. For ZFS you can use sync(1M) which will force out all transactions
 for all files in the pool. That is expensive though. 
 
 Neil.

Your application can call f[d]sync when it's done writing the file
and before it does the close if it wants all the data on disk.
This has been standard operating procedure for many, many
years.

 From TFMP:

DESCRIPTION
  The fsync() function moves all modified data and  attributes
  of  the  file  descriptor  fildes  to a storage device. When
  fsync() returns, all in-memory modified  copies  of  buffers
  associated  with  fildes  have  been written to the physical
  medium. The fsync() function is different from sync(), which
  schedules disk I/O for all files  but returns before the I/O
  completes. The fsync() function forces all outstanding  data
  operations  to  synchronized  file integrity completion (see
  fcntl.h(3HEAD) definition of O_SYNC.)

...

USAGE
  The fsync() function should be  used  by  applications  that
  require  that  a  file  be in a known state. For example, an
  application that  contains  a  simple  transaction  facility
  might  use   fsync() to ensure that all changes to a file or
  files caused by a  given  transaction  were  recorded  on  a
  storage medium.

- Bart

-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Issue with simultaneous IO to lots of ZFS pools

2008-05-01 Thread Bart Smaalders

Chris Siebenmann wrote:
 | There are two issues here.  One is the number of pools, but the other
 | is the small amount of RAM in the server.  To be honest, most laptops
 | today come with 2 GBytes, and most servers are in the 8-16 GByte range
 | (hmmm... I suppose I could look up the average size we sell...)
 
  Speaking as a sysadmin (and a Sun customer), why on earth would I have
 to provision 8 GB+ of RAM on my NFS fileservers? I would much rather
 have that memory in the NFS client machines, where it can actually be
 put to work by user programs.

This depends entirely on the amount of disk  CPU on the fileserver...

A Thumper w/ 48 TB of disk and two dual-core CPUS is prob. somewhat 
under-provisioned
w/ 8 GB of RAM.

- Bart


-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Repairing known bad disk blocks before zfs encounters them

2008-04-16 Thread Bart Smaalders

Richard Elling wrote:
 David wrote:
 I have some code that implements background media scanning so I am able to 
 detect bad blocks well before zfs encounters them.   I need a script or 
 something that will map the known bad block(s) to a logical block so I can 
 force zfs to repair the bad block from redundant/parity data.

 I can't find anything that isn't part of a draconian scanning/repair 
 mechanism.   Granted the zfs architecture can map physical block X to 
 logical block Y, Z, and other letters of the alphabet .. but I want to go 
 backwards. 

 2nd part of the question .. assuming I know /dev/dsk/c0t0d0 has an ECC error 
 on block n, and I now have the appropriate storage pool info  offset that 
 corresponds to that block, then how do I force the file system to repair the 
 offending block.  

 This was easy to address in LINUX assuming the filesystem was built on the 
 /dev/md driver, because all I had to do is force a read and twiddle with the 
 parameters to force a non-cached I/O and subsequent repair.  
   
 
 Just read it.
  -- richard
 
 It seems as if zfs's is too smart for it's own good and won't let me fix 
 something that I know is bad, before zfs has a chance to discover it for 
 itself.   :)
  
  

I think what the OP was saying is that he somehow knows that an
unallocated block on the disk is bad, and he'd like to tell ZFS about it
ahead of time.

But repair implies there's data to read on the disk; ZFS won't read disk
blocks it didn't write.

- Bart


-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance lower than expected

2008-03-20 Thread Bart Smaalders

Bart Van Assche wrote:
 Hello,
 
 I just made a setup in our lab which should make ZFS fly, but unfortunately 
 performance is significantly lower than expected: for large sequential data 
 transfers write speed is about 50 MB/s while I was expecting at least 150 
 MB/s.
 
 Setup
 -
 The setup consists of five servers in total: one OpenSolaris ZFS server and 
 four SAN servers. ZFS accesses the SAN servers via iSCSI and IPoIB.
 
 * ZFS Server
 Operating system: OpenSolaris build 78.
 CPU: Two Intel Xeon CPU's, eight cores in total.
 RAM: 16 GB.
 Disks: not relevant for this test.
 
 * SAN Servers
 Operating system: Linux 2.6.22.18 kernel, 64-bit + iSCSI Enterprise Target 
 (IET). IET has been configured such that it performs both read and write 
 caching.
 CPU: Intel Xeon CPU E5310, 1.60GHz, four cores in total.
 RAM: two servers with 8 GB RAM, one with 4 GB RAM, one with 2 GB RAM.
 Disks: 16 disks in total: two disks with the Linux OS and 14 set up in RAID-0 
 via LVM. The LVM volume is exported via iSCSI and used by ZFS.
 
 These SAN servers give excellent performance results when accessed via Linux' 
 open-iscsi initiator.
 
 * Network
 4x SDR InfiniBand. The raw transfer speed of this network is 8 Gbit/s. 
 Netperf reports 1.6 Gbit/s between the ZFS server and one SAN server (IPoIB, 
 single-threaded). iSCSI transfer speed between the ZFS server and one SAN 
 server is about 150 MB/s.
 
 
 Performance test
 
 Software: xdd (see also http://www.ioperformance.com/products.htm). I 
 modified xdd such that the -dio command line option enables O_RSYNC and 
 O_DSYNC in open() instead of calling directio().
 Test command: xdd -verbose -processlock -dio -op write -targets 1 testfile 
 -reqsize 1 -blocksize $((2**20)) -mbytes 1000 -passes 3
 This test command triggers synchronous writes with a block size of 1 MB 
 (verified this with truss). I am using synchronous writes because these give 
 the same performance results as very large buffered writes (large compared to 
 ZFS' cache size).
 
 Write performance reported by xdd for synchronous sequential writes: 50 MB/s, 
 which is lower than expected.
 
 
 Any help with improving the performance of this setup is highly appreciated.
 
 
 Bart Van Assche.
  
  
 This message posted from opensolaris.org
 ___
 zfs-discuss mailing list
 zfs-discuss@opensolaris.org
 http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

If I understand this correctly, you've stripped the disks together
w/ Linux lvm, then exported a single ISCSI volume to ZFS (or two for 
mirroring;
which isn't clear).

I don't know how many concurrent IOs Solaris thinks your ISCSI volumes 
will handle, but that's one
area to examine.  The only way to realize full performance is going to 
be to get ZFS to issue multiple
IOs to the ISCSI boxes at once.

I'd also suggest just exporting the raw disks to zfs, and have it do the 
stripping.

On 4 commodity 500 GB SATA drives set up w/ RAID Z, my  2.6 Ghz dual 
core AMD box sustains
100+ MB/sec read or write it happily saturates a GB nic w/ multiple 
concurrent reads over
Samba.

W/ 16 drives direct attach you should see close to 500 MB/sec sustained 
IO throughput.


- Bart

-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS partition info makes my system not boot

2008-03-11 Thread Bart Smaalders

Frank Bottone wrote:
 I'm using the latest build of opensolaris express available from
 opensolaris.org.
 
 I had no problems with the install (its an AMD64 x2 3800+, 1gb
 physical ram, 1 ide drive for the os and 4*250GB sata drives attached
 to the motherboard - nforce based chipset).
 
 I create a zfs pool on the 4 sata drives as a raidZ and the pool works
 fine. If I reboot with any of the 4 drives connected the system hangs
 right after all the drives are detected on the post screen. I need to
 put them in a different system and zero them with dd in order to be
 able to reconnect them to my server and still have the system boot
 properly.
 
 Any ideas on how I can get around this? It seems like the onboard
 system itself is getting confused by the metadata ZFS is adding to the
 drive. The system already has the latest available bios from the
 manufacturer - I'm not using any hardware raid of any sort.
 

This is likely the BIOS getting confused by the EFI label on the
disks.  Since there's no newer BIOS available there are two ways
around this problem: 1) put a normal label on the disk and
give zfs slice 2, or 2)  don't have the BIOS do auto-detect on those
drives.  Many BIOSs let you select None for the disk type; this will
allow the system to boot.  Solaris has no problem finding the
drives even w/o the BIOSs help...

- Bart

-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs 32bits

2008-03-06 Thread Bart Smaalders

Brian D. Horn wrote:
 Take a look at CR 6634371.  It's worse than you probably thought.
  

Actually, almost all of the problems noted in that bug are statistics.


- Bart





-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] path-name encodings

2008-03-04 Thread Bart Smaalders

Marcus Sundman wrote:
 Bart Smaalders [EMAIL PROTECTED] wrote:
 UTF8 is the answer here.  If you care about anything more than simple
 ascii and you work in more than a single locale/encoding, use UTF8.
 You may not understand the meaning of a filename, but at least
 you'll see the same characters as the person who wrote it.
 
 I think you are a bit confused.
 
 A) If you meant that _I_ should use UTF-8 then that alone won't help.
 Let's say the person who created the file used ISO-8859-1 and named it
 'häst', i.e., 0x68e47374. If I then use UTF-8 when displaying the
 filename my program will be faced with the problem of what to do with
 the second byte, 0xe4, which can't be decoded using UTF-8. (häst is
 0x68c3a47374 in UTF-8, in case someone wonders.)

What I mean is very simple:

The OS has no way of merging your various encodings.  If I create a
directory, and have people from around the world create a file
in that directory named after themselves in their own character sets,
what should I see when I invoke:

% ls -l | less

in that directory?

If you wish to share filenames across locales, I suggest you and
everyone else writing to that directory use an encoding that will work
across all those locales.  The encoding that works well for this on
Unix systems is UTF8, since it leaves '/' and NULL alone.

- Bart




-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Problems replacing a failed drive.

2008-02-29 Thread Bart Smaalders

Michael Stalnaker wrote:
 I have a 24 disk SATA array running on Open Solaris Nevada, b78. We had 
 a drive fail, and I’ve replaced the device but can’t get the system to 
 recognize that I replaced the drive.
 
 Zpool status –v shows the failed drive:
 
 [EMAIL PROTECTED] ~]$ zpool status -v
   pool: LogData
  state: DEGRADED
 status: One or more devices are faulted in response to persistent errors.
 Sufficient replicas exist for the pool to continue functioning in a
 degraded state.
 action: Replace the faulted device, or use 'zpool clear' to mark the device
 repaired.
  scrub: resilver completed with 0 errors on Wed Feb 27 11:51:45 2008
 config:
 
 NAME STATE READ WRITE CKSUM
 LogData  DEGRADED 0 0 0
   raidz2 DEGRADED 0 0 0
 c0t12d0  ONLINE   0 0 0
 c0t5d0   ONLINE   0 0 0
 c0t0d0   ONLINE   0 0 0
 c0t4d0   ONLINE   0 0 0
 c0t8d0   ONLINE   0 0 0
 c0t16d0  ONLINE   0 0 0
 c0t20d0  ONLINE   0 0 0
 c0t1d0   ONLINE   0 0 0
 c0t9d0   ONLINE   0 0 0
 c0t13d0  ONLINE   0 0 0
 c0t17d0  ONLINE   0 0 0
 c0t20d0  FAULTED  0 0 0  too many errors
 c0t2d0   ONLINE   0 0 0
 c0t6d0   ONLINE   0 0 0
 c0t10d0  ONLINE   0 0 0
 c0t14d0  ONLINE   0 0 0
 c0t18d0  ONLINE   0 0 0
 c0t22d0  ONLINE   0 0 0
 c0t3d0   ONLINE   0 0 0
 c0t7d0   ONLINE   0 0 0
 c0t11d0  ONLINE   0 0 0
 c0t15d0  ONLINE   0 0 0
 c0t19d0  ONLINE   0 0 0
 c0t23d0  ONLINE   0 0 0
 
 errors: No known data errors
 
 
 I tried doing a zpool clear with no luck:
 
 [EMAIL PROTECTED] ~]# zpool clear LogData c0t20d0
 [EMAIL PROTECTED] ~]# zpool status -v
   pool: LogData
  state: DEGRADED
 status: One or more devices are faulted in response to persistent errors.
 Sufficient replicas exist for the pool to continue functioning in a
 degraded state.
 action: Replace the faulted device, or use 'zpool clear' to mark the device
 repaired.
  scrub: resilver completed with 0 errors on Wed Feb 27 11:51:45 2008
 config:
 
 NAME STATE READ WRITE CKSUM
 LogData  DEGRADED 0 0 0
   raidz2 DEGRADED 0 0 0
 c0t12d0  ONLINE   0 0 0
 c0t5d0   ONLINE   0 0 0
 c0t0d0   ONLINE   0 0 0
 c0t4d0   ONLINE   0 0 0
 c0t8d0   ONLINE   0 0 0
 c0t16d0  ONLINE   0 0 0
 c0t20d0  ONLINE   0 0 0
 c0t1d0   ONLINE   0 0 0
 c0t9d0   ONLINE   0 0 0
 c0t13d0  ONLINE   0 0 0
 c0t17d0  ONLINE   0 0 0
 c0t20d0  FAULTED  0 0 0  too many errors
 c0t2d0   ONLINE   0 0 0
 c0t6d0   ONLINE   0 0 0
 c0t10d0  ONLINE   0 0 0
 c0t14d0  ONLINE   0 0 0
 c0t18d0  ONLINE   0 0 0
 c0t22d0  ONLINE   0 0 0
 c0t3d0   ONLINE   0 0 0
 c0t7d0   ONLINE   0 0 0
 
 And I’ve tried zpool replace:
 
 [EMAIL PROTECTED] ~]#
 [EMAIL PROTECTED] ~]# zpool replace -f LogData c0t20d0
 invalid vdev specification
 the following errors must be manually repaired:
 /dev/dsk/c0t20d0s0 is part of active ZFS pool LogData. Please see zpool(1M).
 
 
 So.. What am I missing here folks?
 
 Any help would be appreciated.
 

Did you pull out the old drive and add a new one in its place hot? What 
does
cfgadm -al report?  Your drives should look like this:

sata0/0::dsk/c7t0d0disk connectedconfigured   ok
sata0/1::dsk/c7t1d0disk connectedconfigured   ok
sata1/0::dsk/c8t0d0disk connectedconfigured   ok
sata1/1::dsk/c8t1d0disk connectedconfigured   ok

If c0t20d0 isn't configured, use

# cfgadm -c configure sata1/1::dsk/c0t20d0

before attempting the zpool replace.

hth -

- Bart

-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] path-name encodings

2008-02-27 Thread Bart Smaalders

Marcus Sundman wrote:

 I'm unable to find more info about this. E.g., what does reject file
 names mean in practice? E.g., if a program tries to create a file
 using an utf8-incompatible filename, what happens? Does the fopen()
 fail? Would this normally be a problem? E.g., do tar and similar
 programs convert utf8-incompatible filenames to utf8 upon extraction if
 my locale (or wherever the fs encoding is taken from) is set to use
 utf-8? If they don't, then what happens with archives containing
 utf8-incompatible filenames?


Note that the normal ZFS behavior is exactly what you'd expect: you
get the filenames you wanted; the same ones back you put in.
The trick is that in order to support such things as casesensitivity=false
for CIFS, the OS needs to know what characters are uppercase vs
lowercase, which means it needs to know about encodings, and
reject codepoints which cannot be classified as uppercase vs lowercase.

If you're not running a CIFS server, the defaults will allow you to create
files w/ utf8 names very happily.

: [EMAIL PROTECTED]; cat test
Τη γλώσσα μου έδωσαν ελληνική
: [EMAIL PROTECTED]; cat  `cat test`
this is a test w/ a utf8 filename
: [EMAIL PROTECTED]; ls -l
total 10
-rw-r--r--   1 bartsstaff 37 Oct 22 15:45 Makefile
-rw-r--r--   1 bartsstaff  0 Oct 22 15:46 bar
-rw-r--r--   1 bartsstaff  0 Oct 22 15:46 foo
-rw-r--r--   1 bartsstaff 55 Feb 27 19:45 test
-rw-r--r--   1 bartsstaff301 Feb 27 19:44 test~
-rw-r--r--   1 bartsstaff 34 Feb 27 19:46 Τη γλώσσα μου 
έδωσαν ελληνική
: [EMAIL PROTECTED]; df -h .
Filesystem size   used  avail capacity  Mounted on
zfs/home   228G   136G48G74%/export/home/cyber
: [EMAIL PROTECTED];


- Bart


-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Possible interest for ZFS encryption

2008-02-27 Thread Bart Smaalders

Ian Collins wrote:
   Disk encryption easily defeated, research shows
 
 http://www.itpro.co.uk/storage/news/170304/disk-encryption-easily-defeated-research-shows.html
 
 Freezing RAM, whatever next?
 
 Ian

Interesting... although not leaving system suspended to ram
and zeroing ram on shutdown would seem simple to implement
safeguards.  Yes, if someone steals the laptop while you're using
it you've got problems :-)

- Bart



-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
You will contribute more with mercurial than with thunderbird.
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Plans for swapping to part of a pool

2007-07-12 Thread Bart Smaalders

Lori Alt wrote:
 Darren J Moffat wrote:
 As part of the ARC inception review for ZFS crypto we were asked to 
 follow up on PSARC/2006/370 which indicates that swap  dump will be 
 done using a means other than a ZVOL.

 Currently I have the ZFS crypto project allowing for ephemeral keys to 
 support using a ZVOL as a swap device.

 Since it seems that we won't be swapping on ZVOLS I need to find out 
 more how we will be providing swap and dump space in a root pool.
   
 The current plan is to provide what we're calling (for lack of a
 better term.  I'm open to suggestions.) a pseudo-zvol.  It's
 preallocated space within the pool, logically concatenated by
 a driver to appear like a disk or a slice.  It's meant to be a low
 overhead way to emulate a slice within a pool.  So no COW or
 related zfs features are provided, except for the ability to change
 its size without having to re-partition a disk.  A pseudo-zvol
 will support both swap and dump.
 
 It will also be possible to use a slice for swapping, just as is
 done now with ufs roots.  But we're hoping that the overhead of
 a pseudo-zvol will be low enough that administrators will
 take advantage of it to simplify installation (it allows a user
 to dedicate an entire disk to a root pool, without having to
 carve out part of it for swapping.)
 
 Eventually, swapping on true zvols might be supported (the
 problems with swapping to zvols are considered bugs), but
 fixing those bugs are a bigger task than we want to take on
 for the zfs-boot project.  We decided on pseudo-zvols as
 a lower-risk approach for the time being.
 
 I suspect that the best answer to encrypted swap is that we do it 
 independently of which filesystem/device is being used as the swap 
 device - ie do it inside the VM system.
 '
   
 Treat a pseudo-zvol like you would a slice. 


So these new zvol-like things don't support snapshots, etc, right?
I take it they work by allowing overwriting of the data, correct?

Are these a zslice?

aside
For those of us who've been swapping to zvols for some time, can
you describe the failure modes?
/aside

- Bart

-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS raid is very slow???

2007-07-09 Thread Bart Smaalders

Orvar Korvar wrote:
 When I copy that file from ZFS to /dev/null I get this output:
 real0m0.025s
 user0m0.002s
 sys 0m0.007s
 which can't be correct. Is it wrong of me to use time cp fil fil2 when 
 measuring disk performance?
 

replying to just this part of your message for now

cp opens the source file, mmaps it, opens the target file, and
does a single write of the entire file contents.  /dev/null's
write routine doesn't actually do a copy into the kernel, it just
returns success.  As a result, the source file is never paged into
memory.



-- 
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Slow write speed to ZFS pool (via NFS)

2007-06-19 Thread Bart Smaalders


Joe S wrote:

I have a couple of performance questions.

Right now, I am transferring about 200GB of data via NFS to my new 
Solaris server. I started this YESTERDAY. When writing to my ZFS pool 
via NFS, I notice what I believe to be slow write speeds. My client 
hosts vary between a MacBook Pro running Tiger to a FreeBSD 6.2 Intel 
server. All clients are connected to the a 10/100/1000 switch.


* Is there anything I can tune on my server?
* Is the problem with NFS?
* Do I need to provide any other information?



If you have a lot of small files, doing this sort of thing
over NFS can be pretty painful... for a speedup, consider:

(cd oldroot on client; tar cf - .) | ssh [EMAIL PROTECTED] '(cd newroot on 
server; tar xf -)'


- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Z-Raid performance with Random reads/writes

2007-06-19 Thread Bart Smaalders


michael T sedwick wrote:

Given a 1.6TB ZFS Z-Raid consisting 6 disks:
And a system that does an extreme amount of small /(20K) /random reads 
/(more than twice as many reads as writes) /


1) What performance gains, if any does Z-Raid offer over other RAID or 
Large filesystem configurations?


2) What is any hindrance is Z-Raid to this configuration, given the 
complete randomness and size of these accesses?



Would there be a better means of configuring a ZFS environment for this 
type of activity?


   thanks;



A 6 disk raidz set is not optimal for random reads, since each disk in 
the raidz set needs to be accessed to retrieve each item.  Note that if

the reads are single threaded, this doesn't apply.  However, if multiple
reads are extant at the same time, configuring the disks as 2 sets of
3 disk raidz vdevs or 3 pairs of mirrored disk will result in 2x and 3x
(approx) total parallel random read throughput.

- Bart



--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Z-Raid performance with Random reads/writes

2007-06-19 Thread Bart Smaalders


Ian Collins wrote:

Bart Smaalders wrote:

michael T sedwick wrote:

Given a 1.6TB ZFS Z-Raid consisting 6 disks:
And a system that does an extreme amount of small /(20K) /random
reads /(more than twice as many reads as writes) /

1) What performance gains, if any does Z-Raid offer over other RAID
or Large filesystem configurations?

2) What is any hindrance is Z-Raid to this configuration, given the
complete randomness and size of these accesses?


Would there be a better means of configuring a ZFS environment for
this type of activity?

   thanks;


A 6 disk raidz set is not optimal for random reads, since each disk in
the raidz set needs to be accessed to retrieve each item.  Note that if
the reads are single threaded, this doesn't apply.  However, if multiple
reads are extant at the same time, configuring the disks as 2 sets of
3 disk raidz vdevs or 3 pairs of mirrored disk will result in 2x and 3x
(approx) total parallel random read throughput.


I'm not sure why, but when I was testing various configurations with
bonnie++, 3 pairs of mirrors did give about 3x the random read
performance of a 6 disk raidz, but with 4 pairs, the random read
performance dropped by 50%:

3x2
Blockread:  220464
Random read: 1520.1

4x2
Block read:  295747
Random read:  765.3

Ian
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


interesting I wonder if the blocks being read were stripped across
two mirror pairs; this would result in having to read 2 sets
of mirror pairs, which would produce the reported results...

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Best use of 4 drives?

2007-06-14 Thread Bart Smaalders


Ian Collins wrote:

Rick Mann wrote:

BTW, I don't mind if the boot drive fails, because it will be fairly easy to 
replace, and this server is only mission-critical to me and my friends.

So...suggestions? What's a good way to utilize the power and glory of ZFS in a 
4x 500 GB system, without unnecessary waste?

  

Bung in (add a USB one if you don't have space) a small boot drive and
use all the others for for ZFS.

Ian



This is how I run my home server w/ 4 500GB drives - a small
40GB IDE drive provides root  swap/dump device, the 4 500 GB
drives are RAIDZ  contain all the data.   I ran out of drive
bays, so I used one of those 5 1/4 - 3.5 adaptor brackets
to hang the boot drive where a second DVD drive would go...


- Bart




--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Best use of 4 drives?

2007-06-14 Thread Bart Smaalders


Ian Collins wrote:

Rick Mann wrote:

Ian Collins wrote:

  

Bung in (add a USB one if you don't have space) a small boot drive and
use all the others for for ZFS.


Not a bad idea; I'll have to see where I can put one.

But, I thought I read somewhere that one can't use ZFS for swap. Or maybe I 
read this:

  

I wouldn't bother, just spec the machine with enough RAM so swap's only
real use is as a dump device.  You can always use a swap file if you
have to.

Ian


If you compile stuff (like opensolaris), you'll want swap space.
Esp. if you use dmake; 30 parallel C++ compilations can use up a
lot of RAM.

- Bart



--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Thoughts on CF/SSDs [was: ZFS - Use h/w raid or not?Thoughts.Considerations.]

2007-05-31 Thread Bart Smaalders


Frank Cusack wrote:
On May 31, 2007 1:59:04 PM -0700 Richard Elling [EMAIL PROTECTED] 
wrote:

CF cards aren't generally very fast, so the solid state disk vendors are
putting them into hard disk form factors with SAS/SATA interfaces.  These


If CF cards aren't fast, how will putting them into a different form
factor make them faster?


Well, if I were doing that I'd use DRAM and provide
enough on-board capacitance and a small processor to copy
the contents of the DRAM to flash on power failure.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: How does ZFS write data to disks?

2007-05-16 Thread Bart Smaalders


Bill Moloney wrote:
for example, doing sequential 1MB writes to a  
previously written) zvol (simple catenation of 5 
FC drives in a JBOD) and writing 2GB of data induced 
more than 4GB of IO to the drives (with smaller write 
sizes this ratio gets progressively worse)


How did you measure this?  This would imply that rewriting
a zvol would be limited at below 50% of disk bandwidth, not
something I'm seeing at all.

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Lots of overhead with ZFS - what am I doing wrong?

2007-05-14 Thread Bart Smaalders


Marko Milisavljevic wrote:
I missed an important conclusion from j's data, and that is that single 
disk raw access gives him 56MB/s, and RAID 0 array gives him 
961/46=21MB/s per disk, which comes in at 38% of potential performance. 
That is in the ballpark of getting 45% of potential performance, as I am 
seeing with my puny setup of single or dual drives. Of course, I don't 
expect a complex file system to match raw disk dd performance, but it 
doesn't compare favourably to common file systems like UFS or ext3, so 
the question remains, is ZFS overhead normally this big? That would mean 
that one needs to have at least 4-5 way stripe to generate enough data 
to saturate gigabit ethernet, compared to 2-3 way stripe on a lesser 
filesystem, a possibly important consideration in SOHO situation.





I don't see this on my system, but it has more CPU (dual
core 2.6 GHz).  It saturates a GB net w/ 4 drives  samba,
not working hard at all.  A thumper does 2 GB/sec w 2 dual
core CPUs.

Do you have compression enabled?  This can be a choke point
for weak CPUs.

- Bart


Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: gzip compression throttles system?

2007-05-09 Thread Bart Smaalders


Adam Leventhal wrote:

On Wed, May 09, 2007 at 11:52:06AM +0100, Darren J Moffat wrote:

Can you give some more info on what these problems are.


I was thinking of this bug:

  6460622 zio_nowait() doesn't live up to its name

Which was surprised to find was fixed by Eric in build 59.

Adam



It was pointed out by Jürgen Keil that using ZFS compression
submits a lot of prio 60 tasks to the system task queues;
this would clobber interactive performance.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
---BeginMessage---
 with recent bits ZFS compression is now handled concurrently with  
 many CPUs working on different records.
 So this load will burn more CPUs and acheive it's results  
 (compression) faster.
 
 So the observed pauses should be consistent with that of a load  
 generating high system time.
 The assumption is that compression now goes faster than when is was  
 single threaded.
 
 Is this undesirable ? We might seek a way to slow down compression in  
 order to limit the system load.

According to this dtrace script

#!/usr/sbin/dtrace -s

sdt:genunix::taskq-enqueue
/((taskq_ent_t *)arg1)-tqent_func == (task_func_t *)`zio_write_compress/
{
@where[stack()] = count();
}

tick-5s {
printa(@where);
trunc(@where);
}




... I see bursts of ~ 1000 zio_write_compress() [gzip] taskq calls
enqueued into the spa_zio_issue taskq by zfs`spa_sync() and
its children:

  0  76337 :tick-5s 
...
  zfs`zio_next_stage+0xa1
  zfs`zio_wait_for_children+0x5d
  zfs`zio_wait_children_ready+0x20
  zfs`zio_next_stage_async+0xbb
  zfs`zio_nowait+0x11
  zfs`dbuf_sync_leaf+0x1b3
  zfs`dbuf_sync_list+0x51
  zfs`dbuf_sync_indirect+0xcd
  zfs`dbuf_sync_list+0x5e
  zfs`dbuf_sync_indirect+0xcd
  zfs`dbuf_sync_list+0x5e
  zfs`dnode_sync+0x214
  zfs`dmu_objset_sync_dnodes+0x55
  zfs`dmu_objset_sync+0x13d
  zfs`dsl_dataset_sync+0x42
  zfs`dsl_pool_sync+0xb5
  zfs`spa_sync+0x1c5
  zfs`txg_sync_thread+0x19a
  unix`thread_start+0x8
 1092

  0  76337 :tick-5s 



It seems that after such a batch of compress requests is
submitted to the spa_zio_issue taskq, the kernel is busy
for several seconds working on these taskq entries.
It seems that this blocks all other taskq activity inside the
kernel...



This dtrace script counts the number of 
zio_write_compress() calls enqueued / execed 
by the kernel per second:

#!/usr/sbin/dtrace -qs

sdt:genunix::taskq-enqueue
/((taskq_ent_t *)arg1)-tqent_func == (task_func_t *)`zio_write_compress/
{
this-tqe = (taskq_ent_t *)arg1;
@enq[this-tqe-tqent_func] = count();
}

sdt:genunix::taskq-exec-end
/((taskq_ent_t *)arg1)-tqent_func == (task_func_t *)`zio_write_compress/
{
this-tqe = (taskq_ent_t *)arg1;
@exec[this-tqe-tqent_func] = count();
}

tick-1s {
/*
printf(%Y\n, walltimestamp);
*/
printf(TS(sec): %u\n, timestamp / 10);
printa(enqueue %a: [EMAIL PROTECTED], @enq);
printa(exec%a: [EMAIL PROTECTED], @exec);
trunc(@enq);
trunc(@exec);
}




I see bursts of zio_write_compress() calls enqueued / execed,
and periods of time where no zio_write_compress() taskq calls
are enqueued or execed.

10#  ~jk/src/dtrace/zpool_gzip7.d 
TS(sec): 7829
TS(sec): 7830
TS(sec): 7831
TS(sec): 7832
TS(sec): 7833
TS(sec): 7834
TS(sec): 7835
enqueue zfs`zio_write_compress: 1330
execzfs`zio_write_compress: 1330
TS(sec): 7836
TS(sec): 7837
TS(sec): 7838
TS(sec): 7839
TS(sec): 7840
TS(sec): 7841
TS(sec): 7842
TS(sec): 7843
TS(sec): 7844
enqueue zfs`zio_write_compress: 1116
execzfs`zio_write_compress: 1116
TS(sec): 7845
TS(sec): 7846
TS(sec): 7847
TS(sec): 7848
TS(sec): 7849
TS(sec): 7850
TS(sec): 7851
TS(sec): 7852
TS(sec): 7853
TS(sec): 7854
TS(sec): 7855
TS(sec): 7856
TS(sec): 7857
enqueue zfs`zio_write_compress: 932
execzfs`zio_write_compress: 932
TS(sec): 7858
TS(sec): 7859
TS(sec): 7860
TS(sec): 7861
TS(sec): 7862
TS(sec): 7863
TS(sec): 7864
TS(sec): 7865
TS(sec): 7866
TS(sec): 7867
enqueue zfs`zio_write_compress: 5
execzfs`zio_write_compress: 5
TS(sec): 7868
enqueue zfs`zio_write_compress: 774
execzfs`zio_write_compress: 774
TS(sec): 7869
TS(sec): 7870
TS(sec): 7871
TS(sec): 7872
TS(sec): 7873
TS(sec): 7874
TS(sec): 7875
TS(sec): 7876
enqueue zfs`zio_write_compress: 653
execzfs`zio_write_compress: 653
TS(sec): 7877
TS(sec): 7878
TS(sec): 7879
TS(sec): 7880
TS(sec): 7881


And a final dtrace script, which monitors scheduler activity while
filling a gzip compressed pool:

#!/usr/sbin/dtrace -qs

sched:::off-cpu,
sched:::on-cpu,
sched:::remain-cpu,
sched:::preempt

Re: [zfs-discuss] Force rewriting of all data, to push stripes onto newly added devices?

2007-05-04 Thread Bart Smaalders


Mario Goebbels wrote:
I'm just in sort of a scenario, where I've added devices 
to a pool and would now like the existing data to be spread 
across the new drives, to increase the performance. Is 
there a way to do it, like a scrub? Or would I have to 
have all files to copy over themselves, or similar hacks?


Thanks,
-mg
 



This requires rewriting the block pointers; it's the same
problem as supporting vdev removal.  I would guess that
they'll be solved at the same time.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] gzip compression throttles system?

2007-05-02 Thread Bart Smaalders


Ian Collins wrote:

I just had a quick play with gzip compression on a filesystem and the
result was the machine grinding to a halt while copying some large
(.wav) files to it from another filesystem in the same pool.

The system became very unresponsive, taking several seconds to echo
keystrokes.  The box is a maxed out AMD QuadFX, so it should have plenty
of grunt for this.

Comments?

Ian


How big were the files, what OS build are you running and how
much memory on the system?  Were you copying in parallel?

- Bart



--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance model for sustained, contiguous writes?

2007-04-18 Thread Bart Smaalders

Adam Lindsay wrote:

Hi folks. I'm looking at putting together a 16-disk ZFS array as a server, and
after reading Richard Elling's writings on the matter, I'm now left wondering
if it'll have the performance we expect of such a server. Looking at his
figures, 5x 3-disk RAIDZ sets seems like it *might* be made to do what we want
(saturate a GigE link), but not without some tuning

Am I right in my understanding of relling's small, random read model?
For mirrored configurations, read performance is proportional to the number of disks. Write performance is proportional to the number of mirror sets.

For parity configurations, read performance is proportional to the number of
RAID sets. Write performance is roughly the same.

Clearly, there are elements of the model that don't apply to our sustained read/writes, so does anyone have any guidance (theoretical or empirical) on what we could expect in that arena?

I've seen some references to a different ZFS mode of operation for sustained
and/or contiguous transfers. What should I know about them?

Finally, some requirements I have in speccing up this server:
My requirements:
. Saturate a 1GigE link for sustained reads _and_ writes
... (long story... let's just imagine uncompressed HD video)
. Do it cheaply
My strong desires:
. ZFS for its reliability, redundancy, flexibility, and ease of use
. Maximise the amount of usable space
My resources:
. a server with 16x 500GB SATA drives usable for RAID

What you need to know is what part of your workload is
random reads. This will directly determine the number
of spindles required. Otherwise, if your workload is
sequential reads or writes, you can pretty much just use
an average value for disk throughput with your drives
and adequate CPU, you'll have absolutely no problems
_melting_ a 1GB net. You want to think about how many
disk failures you want to handle before things go south...
there's always a tension between reliability and storage
and performance.

Consider 2 striped sets of raidz2 drives - w/ 6+2 drives in each
set, you get 12 drives worth of streaming IO (read or write).
That will be about 500 MB/sec, rather more than you can get
though a 1 GB net. That's the aggregate bandwidth; you should
be able to both sink and source data at 1Gb/sec w/o any difficulties
at all.

If you do a lot of random reads, however, that config will
behave like 2 disks in terms of IOPs. To do lots of IOPs,
you want to be striped across lots of 2 disk mirror pairs.

My guess is if you're doing video, you're doing lots of
streaming IO (eg you may be reading 20 files at once, but
those files are all being read sequentially). If that's
the case, ZFS can do lots of clever prefetching on
the write side, ZFS due to its COW behavior will just
handle both random and sequentially writes pretty
much the same way.

- Bart

--
Bart Smaalders Solaris Kernel Performance
[EMAIL PROTECTED] http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance model for sustained, contiguous writes?

2007-04-18 Thread Bart Smaalders


Adam Lindsay wrote:
Okay, the way you say it, it sounds like a good thing. I misunderstood 
the performance ramifications of COW and ZFS's opportunistic write 
locations, and came up with much more pessimistic guess that it would 
approach random writes. As it is, I have upper (number of data spindles) 
and lower (number of disk sets) bounds to deal with. I suppose the 
available caching memory is what controls the resilience to the demands 
of random reads?


W/ that many drives (16), if you hit in RAM the reads are not really
random :-), or they span only a tiny fraction of the available disk
space.

Are you reading and writing the same file at the same time?  Your cache
hit rate will be much better then

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS performance model for sustained, contiguous writes?

2007-04-18 Thread Bart Smaalders


Adam Lindsay wrote:

Bart Smaalders wrote:

Adam Lindsay wrote:
Okay, the way you say it, it sounds like a good thing. I 
misunderstood the performance ramifications of COW and ZFS's 
opportunistic write locations, and came up with much more pessimistic 
guess that it would approach random writes. As it is, I have upper 
(number of data spindles) and lower (number of disk sets) bounds to 
deal with. I suppose the available caching memory is what controls 
the resilience to the demands of random reads?


W/ that many drives (16), if you hit in RAM the reads are not really
random :-), or they span only a tiny fraction of the available disk
space.


Clearly I hadn't thought that comment through. :) I think my mental 
model included imagined bottlenecks elsewhere in the system, but I 
haven't got to discussing those yet.




Hmmm... that _was_ prob. more opaque than necessary.  What I meant was
that you've got something on the order of 5TB or better of disk space;
assuming uniformly distributed reads of data and 4 GB of RAM, the odds
of hitting in the cache is essentially zero wrt performance.


Are you reading and writing the same file at the same time?  Your cache
hit rate will be much better then


Not in the general case. Hmm, but there are some scenarios with 
multimedia caching boxes, so that could be interesting to leverage 
eventually.


bedankt,
adam



graag gedaan.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: CAD application not working with zfs

2007-04-10 Thread Bart Smaalders


Dirk Jakobsmeier wrote:

Hello Basrt,

tanks for your answer. The filesystems on different projects are sized between 
20 to 400 gb. Those filesystem sizes where no problem on earlier installation 
(vxfs) and should not be a problem now. I can reproduce this error with the 20 
gb filesystem.

Regards.
 
 


Are you using nfsv4 for the mount?  Or nfsv3?

Some idea of the failing app's system calls just prior to failure
may yield the answer as to what's causing the problem.  These
problems are usually mishandled error conditions...

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] CAD application not working with zfs

2007-04-09 Thread Bart Smaalders


Dirk Jakobsmeier wrote:

Hello,

was use several cad applications and with one of those we have problems using 
zfs.

OS and hardware is SunOS 5.10 Generic_118855-36, Fire X4200, the cad 
application is catia v4.

There are several configuration and data files stored on the server and shared via nfs to 
solaris and aix clients. The application is crashing on the aix client except the server 
is sharing those files from a ufs filesystem. Has anybody informations in this?




What are the sizes of the filesystems being exported?  Perhaps the AIX
client cannot cope w/ large filesystems?

- Basrt





--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is there any performance problem with hard links in ZFS?

2007-03-20 Thread Bart Smaalders


Viktor Turskyi wrote:

Is there any performance problem with hard  links in ZFS?
I have a large storage. There will be near 5 hard links for every file.  Is it ok for ZFS? May be some problems with  snapshots(every 30 minutes there will be a snapshot creating)?  What about difference in speed while working with 5 hardlinks or 5 different files? 


ps: It would be very useful if you give me some links about hardlinks low-level 
processing.
 
 
This message posted from opensolaris.org



On my 2 Ghz opteron w/ 2 mirrored zfs disks:


cyber% cat test.c
#include unistd.h
#include stdlib.h
#include stdio.h
#include strings.h

int
main(int argc, char *argv[])
{
int i;
char *filename;
char buffer[1024];

if (argc != 2) {
fprintf(stderr, usage: %s filename\n, argv[0]);
exit(1);
}

strcpy(buffer, argv[1]);
filename = buffer + strlen(filename);

for (i = 0; i  5; i++) {
sprintf(filename, _%d\n, i);
if(link(foo,  buffer)  0) {
perror(link:);
exit(1);
}
}
}

cyber% ls
testtest.c
cyber% cc -o test test.c
cyber% mkfile 10k foo
cyber% /bin/ptime ./test foo

real0.976
user0.039
sys 0.936
cyber% ls | wc
  13   50003  538906
cyber% /bin/ptime rm foo_*

real1.869
user0.110
sys 1.757
cyber%

So it takes just under 1 second to create 50,000 hardlinks to a file; it 
takes just under 2 seconds to delete 'em w/ rm.  It would prob. be 
faster to use a program to delete them.


- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: update on zfs boot support

2007-03-12 Thread Bart Smaalders


Richard Elling wrote:

[sorry for the late reply, the original got stuck in the mail]
clarification below...


Ian Collins wrote:

Thanks for the heads up.

I'm building a new file server at the moment and

I'd like to make sure I

can migrate to ZFS boot when it arrives.

My current plan is to create a pool on 4 500GB

drives and throw in a

small boot drive.

Will I be able to drop the boot drive and move /

over to the pool when

ZFS boot ships?
  

Yes, should be able to, given that you have

already
had an UFS boot 
drive running root.


Hi, 


However, this raises another concert that during
recent discussions regarding to disk layout of a zfs
system
(http://www.opensolaris.org/jive/thread.jspa?threadID=
25679tstart=0) it was said that currently we'd
better give zfs the whole device (rather than slices)
and keep swap off zfs devices for better performance.

If the above recommendation still holds, we still
have to have a swap device out there othere than
devices managed by zfs. is this limited by the design
or implementation of zfs?


We've updated the wiki to help clarify this confusion.  The consensus best
practice is to have enough RAM that you don't need to swap.  If you need to
swap, your life will be sad no matter what your disk config is.  For those
systems with limited numbers of disks, you really don't have much choice about
where swap is located, so keep track of your swap *usage* and adjust the system
accordingly.
 -- richard
 
 
This message posted from opensolaris.org

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


One thing several of us want to do in Nevada is allocate swap
space transparently out of the root pool. Yes, there'd be
reservations/allocations, etc.  All we need then is a way
to have a dedicated dump device in the same pool...

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] X2200-M2

2007-03-12 Thread Bart Smaalders


Jason J. W. Williams wrote:

Hi Brian,

To my understanding the X2100 M2 and X2200 M2 are basically the same
board OEM'd from Quanta...except the 2200 M2 has two sockets.

As to ZFS and their weirdness, it would seem to me that fixing it
would be more an issue of the SATA/SCSI driver. I may be wrong here.



Actually, what has to happen is that we stop using the SATA chipset
in IDE compat mode and write proper SATA drivers for it... and
manage the upgrade issues,driver name changes, etc.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Efficiency when reading the same file blocks

2007-02-26 Thread Bart Smaalders


Jeff Davis wrote:

Given your question are you about to come back with a
case where you are not 
seeing this?


Actually, the case where I saw the bad behavior was in Linux using the CFQ I/O 
scheduler. When reading the same file sequentially, adding processes 
drastically reduced total disk throughput (single disk machine). Using the 
Linux anticipatory scheduler worked just fine: no additional I/O costs for more 
processes.

That got me worried about the project I'm working on, and I wanted to 
understand ZFS's caching behavior better to prove to myself that the problem 
wouldn't happen under ZFS. Clearly the block will be in cache on the second 
read, but what I'd like to know is if ZFS will ask the disk to do a long, 
efficient sequential read of the disk, or whether it will somehow not recognize 
that the read is sequential because the requests are coming from different 
processes?
 
 
This message posted from opensolaris.org



ZFS has a pretty clever IO scheduler; it will handle multiple
readers of the same file, readers of different files, etc;
in each case prefetch is done correctly.  It also handles
programs that skip blocks...

You can see this pretty simply; for small configs (where
a single CPU can saturate all the drives) the net throughput
of the drives doesn't vary significantly if one is reading a
single file or reading 10 files in parallel.


- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS failed Disk Rebuild time on x4500

2007-02-22 Thread Bart Smaalders






I've measured resync on some slow IDE disks (*not* an X4500) at an average
of 20 MBytes/s.  So if you have a 500 GByte drive, that would resync a 100%
full file system in about 7 hours versus 11 days for some other systems



My experience is that a set of 80% full 250 MB drives took a bit less
than 2 hours each  to replace in a 4x raidz config.  The majority of
space used was taken by large files (isos, music and movie files
(yes, I have teenagers)), although there's a large number of small files 
as well.  This makes for a performance of a bit less than 40 MB/sec 
during resilvering.  The system was pretty sluggish during this 
operation, but it had only got 1GB of RAM, half of which

firefox wanted :-/.

This was build 55 of Nevada.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Implementing fbarrier() on ZFS

2007-02-12 Thread Bart Smaalders


Peter Schuller wrote:

Hello,

Often fsync() is used not because one cares that some piece of data is on 
stable storage, but because one wants to ensure the subsequent I/O operations 
are performed after previous I/O operations are on stable storage. In these 
cases the latency introduced by an fsync() is completely unnecessary. An 
fbarrier() or similar would be extremely useful to get the proper semantics 
while still allowing for better performance than what you get with fsync().


My assumption has been that this has not been traditionally implemented for 
reasons of implementation complexity.


Given ZFS's copy-on-write transactional model, would it not be almost trivial 
to implement fbarrier()? Basically just choose to wrap up the transaction at 
the point of fbarrier() and that's it.


Am I missing something?

(I do not actually have a use case for this on ZFS, since my experience with 
ZFS is thus far limited to my home storage server. But I have wished for an 
fbarrier() many many times over the past few years...)




Hmmm... is store ordering what you're looking for?  Eg
make sure that in the case of power failure, all previous writes
will be visible after reboot if any subsequent write are visible.


- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Re: Cheap ZFS homeserver.

2007-02-07 Thread Bart Smaalders


Tom Buskey wrote:

Tom Buskey wrote:

As a followup, the system I'm trying to use this on

is a dual PII 400 with 512MB.  Real low budget.



Hmm... that's lower than I would have expected.
 Something is
ikely wrong.  These machines do have very limited
memory



How fast can you DD from the raw device to /dev/null?


Roughly 230Mb/s





Do you mean ~28MB/sec?

Something is definitely bogus.  What happens when you do dd
from both drives at once?

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] FYI: ZFS on USB sticks (from Germany)

2007-02-02 Thread Bart Smaalders


Constantin Gonzalez wrote:

Hi Richard,

Richard Elling wrote:

FYI,
here is an interesting blog on using ZFS with a dozen USB drives from
Constantin.
http://blogs.sun.com/solarium/entry/solaris_zfs_auf_12_usb


thank you for spotting it :).

We're working on translating the video (hope we get the lip-syncing right...)
and will then re-release it in an english version. BTW, we've now hosted
the video on YouTube so it can be embedded in the blog.

Of course, I'll then write an english version of the blog entry with the
tech details.

Please hang on for a week or two... :).

Best regards,
   Constantin



Brilliant video, guys.  I particularly liked the fellow
in the background with the hardhat and snow shovel :-).

The USB stick machinations were pretty cool, too.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: can I use zfs on just a partition?

2007-01-25 Thread Bart Smaalders


Tim Cook wrote:

I guess I should clarify what I'm doing.

Essentially I'd like to have the / and swap on the first 60GB of the disk.  
Then use the remaining 100GB as a zfs partition to setup zones on.  Obviously 
the snapshots are extremely useful in such a setup :)

Does my plan sound feasible from both a usability and performance standpoint?
 


That's exactly how I'm running my ferrari laptop.  Works
like a charm.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: External drive enclosures + Sun Server for massstorage

2007-01-23 Thread Bart Smaalders


Frank Cusack wrote:

yes I am an experienced Solaris admin and know all about devfsadm :-)
and the older disks command.

It doesn't help in this case.  I think it's a BIOS thing.  Linux and
Windows can't see IDE drives that aren't there at boot time either,
and on Solaris the SATA controller runs in some legacy mode so I guess
that's why you can't see the newly added drive.

Unfortunately all my x2100 hardware is in production and I can't
readily retest this to verify.

-frank


This is exactly the issue; some of the simple SATA drives
are used in PATA compatibility mode.  The ide driver doesn't
know a thing about hot anything, so we would need a proper
SATA driver for these chips. Since they work (with the exception
of hot *) it is difficult to prioritize this work above getting
some other piece of hardware working under Solaris.  In addition,
switching drivers  bios configs during upgrade is a non-trivial
exercise.


- Bart



--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: ZFS direct IO

2007-01-23 Thread Bart Smaalders


[EMAIL PROTECTED] wrote:

In order to protect the user pages while a DIO is in progress, we want
support from the VM that isn't presently implemented.  To prevent a page
from being accessed by another thread, we have to unmap the TLB/PTE
entries and lock the page.  There's a cost associated with this, as it
may be necessary to cross-call other CPUs.  Any thread that accesses the
locked pages will block.  While it's possible lock pages in the VM
today, there isn't a neat set of interfaces the filesystem can use to
maintain the integrity of the user's buffers.  Without an experimental
prototype to verify the design, it's impossible to say whether overhead
of manipulating the page permissions is more than the cost of bypassing
the cache.


Note also that for most applications, the size of their IO operations
would often not match the current page size of the buffer, causing
additional performance and scalability issues.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] X2100 not hotswap

2007-01-23 Thread Bart Smaalders


Frank Cusack wrote:

It's interesting the topics that come up here, which really have little to
do with zfs.  I guess it just shows how great zfs is.  I mean, you would
never have a ufs list that talked about the merits of sata vs sas and what
hardware do i buy.  Also interesting is that zfs exposes hardware bugs
yet I don't think that's what really drives the hardware questions here.



Actually, I think it's the easy admin of more that a simple mirror
so all of a sudden it's simple to deal with multiple drives, add more
later, etc... so connectivity to low end boxes becomes important.

Also, of course, SATA is still relatively new and we don't yet
have extensive controller support (understatement).


- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS on PC Based Hardware for NAS?

2007-01-22 Thread Bart Smaalders


Elm, Rob wrote:

Hello ZFS Discussion Members,
 
I'm looking for help or advice on a project I'm working on:
 
I have a 939 Gigabyte motherboard, with 4 SATAII ports on the nForce4

chipset, and 4 SATA ports off the SIL3114 controller.  I recently purchased
5, 320gig SATAII drives...
 
http://tinyurl.com/yf5z9o
 
I wanted to install Solaris86 on this machine and turn it into a NAS server.

ZFS looks very attractive, but I don't believe it can be used for a boot
drive.
 
How would you setup a system like this?
 


I'd boot on ide for now.  In the future, we'll be able to boot from
a ZFS mirror, but since most root drives don't get much use, sticking
w/ two ide drives there would prob. be fine.

There are performance/space/safety tradeoffs to be made.  What are your
goals wrt to these attributes?


I can purchase additional SATA or IDE hard drives...For example, I could get
3 more 320gig SATAII drives, and fill all the SATA ports.  And hook up an
IDE drive as the system boot drive.
 
Sincerely,





You may wish to take a look at my latest blog post:

http://blogs.sun.com/barts

- Bart



--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Help understanding some benchmark results

2007-01-12 Thread Bart Smaalders


Chris Smith wrote:

What build/version of Solaris/ZFS are you using?


Solaris 11/06.



What block size are you using for writes in bonnie++?
 I
ind performance on streaming writes is better w/
larger writes.


I'm afraid I don't know what block size bonnie++ uses by default - I'm not 
specifying one on the commandline.



trussing bonnie will tell you... I just built bonnie++; by default it
uses 8k.


What happens when you run two threads at once?  Does
write performance
improve?


If I use raidz, no (overall throughput is actually nearly halved !).  If I use 
RAID0 (just striped disks, no redundancy) it improves (significantly in some 
cases).



Increasing the blocksize will help.  You can do that on bonnie++ like
this:

./bonnie++ -d /internal/ -s 8g:128k ...

Make sure you don't have compression on


Some observations:
* This machine only has 32 bit CPUs.  Could that be limiting performance ?


It will, but it shouldn't be too awful here.  You can lower kernelbase
to let the kernel have more of the RAM on the machine.  You're more 
likely going to run into problems w/ the front side bus; my experience

w/ older Xeons is that one CPU could easily saturate the FSB; using the
other would just make things worse.  You should not be running into that
yet, either, though.  Offline one of the CPUs w/ psradm -f 1; reenable 
w/ psradm -n 1.



* A single drive will hit ~60MB/s read and write.  Since these are only 7200rpm 
SATA disks, that's probably all they've got to give.
On a good day on the right part of the drive... slowest to fastest 
sectors can be 2:1 in performance...


What can you get with your drives w/ dd to the raw device when not part 
of a pool?


Eg /bin/ptime dd if=/dev/zero of=/dev/dsk/... bs=128k bc=2

You can also do this test to a file to see what the peak is going to be...


What kind of write performance do people get out of those honkin' big x4500s ?
 


~2GB/sec locally, 1 GB/sec over the network.

This requires multiple writing threads; a single CPU just isn't fast
enough to write 2GB/sec.

- Bart



--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Help understanding some benchmark results

2007-01-11 Thread Bart Smaalders

Chris Smith wrote:

G'day, all,

So, I've decided to migrate my home server from Linux+swRAID+LVM to
Solaris+ZFS, because it seems to hold much better promise for data integrity,
which is my primary concern.

However, naturally, I decided to do some benchmarks in the process, and I don't
understand why the results are what they are. I though I had a reasonable
understanding of ZFS, but now I'm not so sure.

I've used bonnie++ and a variety of Linux RAID configs below to approximate
equivalent ZFS configurations and compare. I do realise they're not exactly
the same thing, but it seems to me they're reasonable comparisons and should
return at least somewhat similar performance. I also realise bonnie++ is not
an especially comprehensive or complex benchmark, but ultimately I don't really
care about performance and this was only done out of curiosity.

The executive summary is that ZFS write performance appears to be relatively awful (all the time),
and it's read performance is relatively good most of the time (with striping,
mirroring and raidz[2]'s with fewer numbers of disks).

Examples:
* 8-disk RAID0 on Linux returns about 190MB/s write and 245MB/sec read, while a
ZFS raidz using the same disks returns about 120MB/sec write, but 420MB/sec
read.
* 16-disk RAID10 on Linux returns 165MB/sec and 440MB/sec write and read, while
a ZFS pool with 8 mirrored disks returns 140MB/sec write and 410MB/sec read.
* 16-disk RAID6 on Linux returns 126MB/sec write, 162MB/sec read, while a
16-disk raidz2 returns 80MB/sec write and 142MB/sec read.

The biggest problem I am having understanding why is it so, is because I was
under the impression with ZFS's CoW, etc, that writing (*especially* writes like this, to
a raidz array) should be much faster than a regular old-fashioned RAID6 array.

I certainly can't complain about the read speed, however - 400-odd MB/sec out
of this old beastie is pretty impressive :).

Help ? Have I missed something obvious or done something silly ?
(Additionally, from the Linux perspective, why are reads so slow ?)

What build/version of Solaris/ZFS are you using?

What block size are you using for writes in bonnie++? I
find performance on streaming writes is better w/ larger writes.

What happens when you run two threads at once? Does write performance
improve?

Does zpool iostat -v 1 report anything interesting during the benchmark?
What about iostat -x 1? Is one disk significantly more busy than the
others?

I have a 4x 500GB disk raidz config w/ a 2.6 GHz dual core at home;
this config sustains approx 120 MB/sec on reads and writes on single
or multiple streams. I'm running build 55; the box has a SI controller
running in PATA compat. mode.

One of the challenging aspects of performance work on these sorts of
things is separating out drivers vs cpus vs memory bandwidth vs disk
behavior vs intrinsic FS behavior.

- Bart

Re: [zfs-discuss] use the same zfs filesystem with differnet mountpoint

2007-01-11 Thread Bart Smaalders


Fabian Wörner wrote:

I think of have solaris and mac os 10.5 on the same machine and mount same 
filesystem on to differnet point on each os.
Is/will it possible or do I have to use sym. links?
 


Since the mount point is stored in the ZFS pool, you'll
need to use legacy mounts to do this.  This works fine
between different Solaris versions; if the MAC folks
didn't change their on disk format it might just work
between OS-X and Solaris as well.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Question: ZFS + Block level SHA256 ~= almost free CAS Squishing?

2007-01-08 Thread Bart Smaalders


[EMAIL PROTECTED] wrote:




I have been looking at zfs source trying to get up to speed on the
internals.  One thing that interests me about the fs is what appears to be
a low hanging fruit for block squishing CAS (Content Addressable Storage).
I think that in addition to lzjb compression, squishing blocks that contain
the same data would buy a lot of space for administrators working in many
common workflows.

I am writing to see if I can get some feedback from people that know the
code better than I -- are there any gotchas in my logic?

Assumptions:

SHA256 hash used (Fletcher2/4 have too many collisions,  SHA256 is 2^128 if
I remember correctly)
SHA256 hash is taken on the data portion of the block as it exists on disk.
the metadata structure is hashed separately.
In the current metadata structure, there is a reserved bit portion to be
used in the future.


Description of change:
Creates:
The filesystem goes through its normal process of writing a block, and
creating the checksum.
Before the step where the metadata tree is pushed, the checksum is checked
against a global checksum tree to see if there is any match.
If match exists, insert a metadata placeholder for the block, that
references the already existing block on disk, increment a number_of_links
pointer on the metadata blocks to keep track of the pointers pointing to
this block.
free up the new block that was written and check-summed to be used in the
future.
else if no match, update the checksum tree with the new checksum and
continue as normal.


Deletes:
normal process, except verifying that the number_of_links count is lowered
and if it is non zero then do not free the block.
clean up checksum tree as needed.

What this requires:
A new flag in metadata that can tag the block as a CAS block.
A checksum tree that allows easy fast lookup of checksum keys.
a counter in the metadata or hash tree that tracks links back to blocks.
Some additions to the userland apps to push the config/enable modes.

Does this seem feasible?  Are there any blocking points that I am missing
or unaware of?   I am just posting this for discussion,  it seems very
interesting to me.



Note that you'd actually have to verify that the blocks were the same;
you cannot count on the hash function.  If you didn't do this, anyone
discovering a collision could destroy the colliding blocks/files.
Val Henson wrote a paper on this topic; there's a copy here:

http://infohost.nmt.edu/~val/review/hash.pdf

- Bart

Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Puzzling ZFS behavior with COMPRESS option

2007-01-08 Thread Bart Smaalders


Anantha N. Srirama wrote:

Quick update, since my original post I've confirmed via DTrace (rwtop script in 
toolkit) that the application is not generating 150MB/S * compressratio of I/O. 
What then is causing this much I/O in our system?
 
 
This message posted from opensolaris.org



Are you doing random IO?  Appending or overwriting?

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS in a SAN environment

2006-12-20 Thread Bart Smaalders


Jason J. W. Williams wrote:

Not sure. I don't see an advantage to moving off UFS for boot pools. :-)

-J


Except of course that snapshots  clones will surely be a nicer
way of recovering from adverse administrative events...

-= Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] hardware planning for storage server

2006-12-11 Thread Bart Smaalders


Jakob Praher wrote:

hi all,

I'd like to build a solid storage server using zfs and opensolaris. The
server more or less should have a NAS role, thus using nfsv4 to export
the data to other nodes.

...
what would be your reasonable advice?



First of all, figure out what you need in terms of capacity and
IOPS/sec.  This will determine the number of spindles, cpus,
network adaptors, etc.

Keep in mind, for large sequential reads and large writes you can
get a significant fraction of the max throughput of the drives;
my 4 x 500 GB RAIDZ configuration does 150 MB/sec pretty consistently.

If you're doing small random reads or writes, you'll be much more
limited by the number of spindles and the way you configure them.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] raidz DEGRADED state

2006-11-30 Thread Bart Smaalders


Krzys wrote:


my drive did go bad on me, how do I replace it? I am sunning solaris 10 
U2 (by the way, I thought U3 would be out in November, will it be out 
soon? does anyone know?



[11:35:14] server11: /export/home/me  zpool status -x
  pool: mypool2
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas 
exist for

the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: none requested
config:

NAMESTATE READ WRITE CKSUM
mypool2 DEGRADED 0 0 0
  raidz DEGRADED 0 0 0
c3t0d0  ONLINE   0 0 0
c3t1d0  ONLINE   0 0 0
c3t2d0  ONLINE   0 0 0
c3t3d0  ONLINE   0 0 0
c3t4d0  ONLINE   0 0 0
c3t5d0  ONLINE   0 0 0
c3t6d0  UNAVAIL  0   679 0  cannot open

errors: No known data errors
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


Shut down the machine, replace the drive, reboot
and type:

zpool replace mypool2 c3t6d0


On earlier versions of ZFS I found it useful to do this
at the login prompt; it seemed fairly memory intensive.

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Is there inotify for ZFS?

2006-11-10 Thread Bart Smaalders


LingBo Tang wrote:

Hi all,

As inotify for Linux, is there same mechanism in Solaris for ZFS?
I think this functionality is helpful for desktop search engine.

I know one engineer of Sun is working on file event monitor, which
will provide some information of file events, but is not for
search purpose because it might has problem while monitoring
a large system.



The right way to implement a desktop search engine w/ ZFS
is an API that would let you cheaply discover all the files
modified after an arbitrary file in that filesystem.

Note that a filesystem is capable of being modified far faster
than a indexing program can process those modifications.  As a
result, any notification scheme must either block further
filesystem changes (unacceptable), provide infinite storage
of pending change notification events (difficult in practice),
or provide a means for re-discovering what has changed since
the last time the changes were examined.  Since the latter
mechanism is needed anyway to handle initialization or
modifications during periods when the search engine isn't
running, making finding modified files cheap seems like the
easiest and most robust approach.

Since ZFS uses COW semantics, it is possible to provide
a means to very cheaply discover files that have been modified
since another file in the filesystem.  This cost of this
discovery may be very roughly order of the number of modified files
times the average number of files in a directory times mean
modified file directory depth.

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Recommended Minimum Hardware for ZFS Fileserver?

2006-10-30 Thread Bart Smaalders


Wes Williams wrote:

Thanks gents for  your replies.  I've used to a very large config W2100z and ZFS for 
awhile but didn't know how low can you go for ZFS to shine, though a 64-bit 
CPU seems to be the minimum performance threshold.

Now that Sun's store is [sort of] working again, I can see some X2100's with 
the custom configure and a very low starting price of only $450 sans CPU, 
drives, and memory.  Great!!

[b]If only we could get a basic X2100-ish designed, custom build priced, 
server from Sun that could also hold 3-5 drives internally[/b], I could see a bunch of 
those being used as ZFS file servers.  This would also be a good price point for small 
office and home users since the X4100 is certainly overkill in this application, though 
I'd wouldn't refuse one offered to me.  =)
 



I built my own, using essentially the same mobo (tyan 2865).
The Ultra 20 is slightly different, but not enough to matter.

I put it in a case that would hold more drives and a larger
power supply, and I've got a nice home server w/ a TB of
disk (effective space 750GB).

Very simple and easy.  Right now I'm still using a single
disk for /, since I'm worried about safegarding data, not
making sure I have max availability.

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Where is the ZFS configuration data stored?

2006-10-12 Thread Bart Smaalders


Sergey wrote:

+ a little addition to the original quesion:

Imagine that you have a RAID attached to Solaris server. There's ZFS on RAID. 
And someday you lost your server completely (fired motherboard, physical crash, 
...). Is there any way to connect the RAID to some another server and restore 
ZFS layout (not loosing all data on RAID)?
 
 



If the RAID controller is undamaged, just hook it up and go;
you can import the ZFS pool on another system seamlessly.

If the RAID controller gets damaged, you'll need to follow
the manufacturer's documentation to restore your data.

JBODs are simple, easy and relatively foolproof when used
w/ ZFS.

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposal: multiple copies of user data

2006-09-13 Thread Bart Smaalders


Torrey McMahon wrote:

eric kustarz wrote:


I want per pool, per dataset, and per file - where all are done by the 
filesystem (ZFS), not the application.  I was talking about a further 
enhancement to copies than what Matt is currently proposing - per 
file copies, but its more work (one thing being we don't have 
administrative control over files per se).


Now if you could do that and make it something that can be set at 
install time it would get a lot more interesting. When you install 
Solaris to that single laptop drive you can select files or even 
directories that have more then one copy in case of a problem down the 
road.




Actually, this is a perfect use case for setting the copies=2
property after installation.  The original binaries are
quite replaceable; the customizations and personal files
created later on are not.

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Problem with ZFS's performance

2006-09-08 Thread Bart Smaalders


Josip Gracin wrote:

Hello!

Could somebody please explain the following bad performance of a machine 
running ZFS.  I have a feeling it has something to with the way ZFS uses 
memory, because I've checked with ::kmastat and it shows that ZFS uses 
huge amounts of memory which I think is killing the performance of the 
machine.


This is the test program:

#include malloc.h
#include stdio.h

int main()
{
char *buf = calloc(51200,1);
if ( buf == NULL ) {
printf(Allocation failed.\n);
}
return 0;
}


I've run the test program on the following two different machines, both 
under light load:


Machine A is AMD64 3000+ (2.0GHz),   1GB   RAM running snv_46.
-
Machine B is Pentium4,2.4GHz,   512MB  RAM running Linux.


Execution times on several consecutive runs are:

Machine A

time ./a.out
./a.out  0.49s user 1.39s system 2% cpu 1:03.25 total
./a.out  0.48s user 1.28s system 3% cpu 50.691 total
./a.out  0.48s user 1.27s system 4% cpu 38.225 total
./a.out  0.48s user 1.24s system 5% cpu 30.694 total
./a.out  0.47s user 1.20s system 5% cpu 28.640 total
./a.out  0.47s user 1.23s system 6% cpu 28.210 total
./a.out  0.47s user 1.21s system 6% cpu 27.700 total
./a.out  0.47s user 1.19s system 9% cpu 17.875 total
./a.out  0.46s user 1.15s system 12% cpu 12.784 total

On machine B

[the first run took approx. 10 seconds, I forgot to paste it]
./a.out  0.14s user 0.89s system 27% cpu 3.711 total
./a.out  0.13s user 0.87s system 25% cpu 3.926 total
./a.out  0.11s user 0.90s system 29% cpu 3.456 total
./a.out  0.11s user 0.91s system 29% cpu 3.435 total
./a.out  0.10s user 0.91s system 38% cpu 2.597 total
./a.out  0.11s user 0.93s system 35% cpu 2.913 total

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


There are several things going on here, and part of that may well be
the memory utilization of ZFS.  Have you tried the same experiment
when not using ZFS?


Keep in mind that Solaris doesn't always use the most efficient
strategies for paging applications... this is something we're actively
working on fixing as part of the VM work going on...

-Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Re: Write cache

2006-07-26 Thread Bart Smaalders


Jesus Cea wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Neil Perrin wrote:

I suppose if you know
the disk only contains zfs slices then write caching could be
manually enabled using format -e - cache - write_cache - enable


When will we have write cache control over ATA/SATA drives? :-).



A method of controlling write cache independent of drive
type, color or flavor is being developed I'll ping
the responsible parties (bcc'd).

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zfs sucking down my memory!?

2006-07-21 Thread Bart Smaalders


Joseph Mocker wrote:

Bart Smaalders wrote:



How much swap space is configured on this machine?


Zero. Is there any reason I would want to configure any swap space?

 --joe


Well, if you want to allocate 500 MB in /tmp, and your machine
has no swap, you need 500M of physical memory or the write
_will_ fail.

W/ no swap configured, every allocation in every process of
any malloc'd memory, etc, is locked into RAM.

I just swap on a zvol w/ my ZFS root machine.

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Bart Smaalders


Matthew Ahrens wrote:

On Mon, Jul 17, 2006 at 09:44:28AM -0700, Bart Smaalders wrote:

Mark Shellenbaum wrote:

PERMISSION GRANTING

zfs allow -c ability[,ability...] dataset

-c Create means that the permission will be granted (Locally) to the
creator on any newly-created descendant filesystems.

ALLOW EXAMPLE 

Lets setup a public build machine where engineers in group staff can 
create ZFS file systems,clones,snapshots and so on, but you want to allow 
only creator of the file system to destroy it.


# zpool create sandbox disks
# chmod 1777 /sandbox
# zfs allow -l staff create sandbox
# zfs allow -c create,destroy,snapshot,clone,promote,mount sandbox

So as administrator what do I need to do to set
/export/home up for users to be able to create their own
snapshots, create dependent filesystems (but still mounted
underneath their /export/home/usrname)?

In other words, is there a way to specify the rights of the
owner of a filesystem rather than the individual - eg, delayed
evaluation of the owner?


I think you're asking for the -c Creator flag.  This allows
permissions (eg, to take snapshots) to be granted to whoever creates the
filesystem.  The above example shows how this might be done.

--matt


Actually, I think I mean owner.

I want root to create a new filesystem for a new user under
the /export/home filesystem, but then have that user get the
right privs via inheritance rather than requiring root to run
a set of zfs commands.

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Proposal: delegated administration

2006-07-17 Thread Bart Smaalders


Matthew Ahrens wrote:

On Mon, Jul 17, 2006 at 10:00:44AM -0700, Bart Smaalders wrote:

So as administrator what do I need to do to set
/export/home up for users to be able to create their own
snapshots, create dependent filesystems (but still mounted
underneath their /export/home/usrname)?

In other words, is there a way to specify the rights of the
owner of a filesystem rather than the individual - eg, delayed
evaluation of the owner?

I think you're asking for the -c Creator flag.  This allows
permissions (eg, to take snapshots) to be granted to whoever creates the
filesystem.  The above example shows how this might be done.

--matt

Actually, I think I mean owner.

I want root to create a new filesystem for a new user under
the /export/home filesystem, but then have that user get the
right privs via inheritance rather than requiring root to run
a set of zfs commands.


In that case, how should the system determine who the owner is?  We
toyed with the idea of figuring out the user based on the last component
of the filesystem name, but that seemed too tricky, at least for the
first version.

FYI, here is how you can do it with an additional zfs command:

# zfs create tank/home/barts
# zfs allow barts create,snapshot,... tank/home/barts

--matt


Owner of the top level directory is the owner of the filesystem?

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS home/JDS interaction issue

2006-06-29 Thread Bart Smaalders


Thomas Maier-Komor wrote:

Hi,

I just upgraded my machine at home to Solaris 10U2. As I already had a ZFS, I 
wanted to migrate my home directories at once to a ZFS from a local UFS 
metadisk. Copying and changing the config of the automounter succeeded without 
any problems. But when I tried to login to JDS, login suceeded, but JDS did not 
start and the X session gets always terminated after a couple of seconds. 
/var/dt/Xerrors says that /dev/fb could not be accessed, although it works 
without any problem when running from the UFS filesystem.

Switching back to my UFS based home resolved this issue. I even tried switching 
over to ZFS and rebooted the machine to make 100% sure everything is in a sane 
state (i.e. no gconfd etc.), but the issue persisted and switching back to UFS 
again resolved this issue.

Has anybody else had similar problems? Any idea how to resolve this?

TIA,
Tom


I'm running w/ ZFS mounted home directories both on my home
and work machines; my work desktop has ZFS root as well.

Are you sure you moved just your home directory?

Is the automounter config the same (wrt to setuid, etc)?

Can you log in as root when ZFS is your home directory?
If not, there's something else going on

- Bart


--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS and Storage

2006-06-26 Thread Bart Smaalders


Gregory Shaw wrote:

On Tue, 2006-06-27 at 09:09 +1000, Nathan Kroenert wrote:

How would ZFS self heal in this case?




You're using hardware raid.  The hardware raid controller will rebuild
the volume in the event of a single drive failure.  You'd need to keep
on top of it, but that's a given in the case of either hardware or
software raid.

If you've got requirements for surviving an array failure, the
recommended solution in that case is to mirror between volumes on
multiple arrays.   I've always liked software raid (mirroring) in that
case, as no manual intervention is needed in the event of an array
failure.  Mirroring between discrete arrays is usually reserved for
mission-critical applications that cost thousands of dollars per hour in
downtime.



In other words, it won't.  You've spent the disk space, but
because you're mirroring in the wrong place (the raid array)
all ZFS can do is tell you that your data is gone.  With luck,
subsequent reads _might_ get the right data, but maybe not.

- Bart

--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] hard drive write cache

2006-05-26 Thread Bart Smaalders


Gregory Shaw wrote:

I had a question to the group:
In the different ZFS discussions in zfs-discuss, I've seen a 
recurring theme of disabling write cache on disks. I would think that 
the performance increase of using write cache would be an advantage, and 
that write cache should be enabled.
Realistically, I can see only one situation where write cache would 
be an issue.  If there is no way to flush the write cache, it would be 
possible for corruption to occur due to a power loss.


There are two failure modes associated with disk write caches:

1) the disk write cache for performance reasons doesn't write back
   data (to diff. blocks) to the platter in the order they were
   received, so transactional ordering isn't maintained and
   corruption can occur.

2) writes to different can disks have different caching policies, so
   transactions to files on different filesystems may not complete
   correctly during a power failure.

ZFS enables the write cache and flushes it when committing transaction
groups; this insures that all of a transaction group appears or does
not appear on disk.

- Bart




--
Bart Smaalders  Solaris Kernel Performance
[EMAIL PROTECTED]   http://blogs.sun.com/barts
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

66 matches

Mail list logo