Re: stupid UFS behaviour on random writes

2013-01-18 Thread Stefan Esser
Am 18.01.2013 00:01, schrieb Rick Macklem:
 Wojciech Puchar wrote:
 create 10GB file (on 2GB RAM machine, with some swap used to make sure
 little cache would be available for filesystem.

 dd if=/dev/zero of=file bs=1m count=10k

 block size is 32KB, fragment size 4k


 now test random read access to it (10 threads)

 randomio test 10 0 0 4096

 normal result on such not so fast disk in my laptop.

 118.5 | 118.5 5.8 82.3 383.2 85.6 | 0.0 inf nan 0.0 nan
 138.4 | 138.4 3.9 72.2 499.7 76.1 | 0.0 inf nan 0.0 nan
 142.9 | 142.9 5.4 69.9 297.7 60.9 | 0.0 inf nan 0.0 nan
 133.9 | 133.9 4.3 74.1 480.1 75.1 | 0.0 inf nan 0.0 nan
 138.4 | 138.4 5.1 72.1 380.0 71.3 | 0.0 inf nan 0.0 nan
 145.9 | 145.9 4.7 68.8 419.3 69.6 | 0.0 inf nan 0.0 nan


 systat shows 4kB I/O size. all is fine.

 BUT random 4kB writes

 randomio test 10 1 0 4096

 total | read: latency (ms) | write: latency (ms)
 iops | iops min avg max sdev | iops min avg max
 sdev
 +---+--
 38.5 | 0.0 inf nan 0.0 nan | 38.5 9.0 166.5 1156.8 261.5
 44.0 | 0.0 inf nan 0.0 nan | 44.0 0.1 251.2 2616.7 492.7
 44.0 | 0.0 inf nan 0.0 nan | 44.0 7.6 178.3 1895.4 330.0
 45.0 | 0.0 inf nan 0.0 nan | 45.0 0.0 239.8 3457.4 522.3
 45.5 | 0.0 inf nan 0.0 nan | 45.5 0.1 249.8 5126.7 621.0



 results are horrific. systat shows 32kB I/O, gstat shows half are
 reads
 half are writes.

 Why UFS need to read full block, change one 4kB part and then write
 back, instead of just writing 4kB part?
 
 Because that's the way the buffer cache works. It writes an entire buffer
 cache block (unless at the end of file), so it must read the rest of the 
 block into
 the buffer, so it doesn't write garbage (the rest of the block) out.

Without having looked at the code or testing:

I assume using O_DIRECT when opening the file should help for that
particular test (on kernels compiled with options DIRECTIO).

 I'd argue that using an I/O size smaller than the file system block size is
 simply sub-optimal and that most apps. don't do random I/O of blocks.
 OR
 If you had an app. that does random I/O of 4K blocks (at 4K byte offsets),
 then using a 4K/1K file system would be better.

A 4k/1k file system has higher overhead (more indirect blocks) and
is clearly sub-obtimal for most general uses, today.

 NFS is the exception, in that it keeps track of a dirty byte range within
 a buffer cache block and writes that byte range. (NFS writes are byte 
 granular,
 unlike a disk.)

I should be easy to add support for a fragment mask to the buffer
cache, which allows to identify valid fragments. Such a mask should
be set to 0xff for all current uses of the buffer cache (meaning the
full block is valid), but a special case could then be added for writes
of exactly one or multiple fragments, where only the corresponding
valid flag bits were set. In addition, a possible later read from
disk must obviously skip fragments for which the valid mask bits are
already set.
This bit mask could then be used to update the affected fragments
only, without a read-modify-write of the containing block.

But I doubt that such a change would improve performance in the
general case, just in random update scenarios (which might still
be relevant, in case of a DBMS knowing the fragment size and using
it for DB files).

Regards, STefan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: stupid UFS behaviour on random writes

2013-01-18 Thread Wojciech Puchar


But I doubt that such a change would improve performance in the


you doubt but i am sure it would improve it a lot. Just imagine multiple 
VM images on filesystem, running windoze with 4kB cluster size, each writing something.


no matter what is written from within VM it ends up as read followed by 
write, unless blocks are already in buffer cache.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


[GIANT-LOCKED] even without D_NEEDGIANT

2013-01-18 Thread Jimmy Olgeni


Hello list,

At $DAILY_JOB I got involved with an ASI board that didn't have any 
kind of FreeBSD support, so I ended up writing a driver for it.


If you try to ignore the blatant style(9) violations (of which there 
are many, hopefully on the way to be cleaned up) it seems to work fine.


However, I noticed that when loading the driver I always get a
message about the giant lock being used, even if D_NEEDGIANT is not
specified anywhere.

The actual output when loading is this (FreeBSD 9-STABLE i386):

dektec0: DekTec DTA-145 mem 0xfeaff800-0xfeaf irq 16 at device 13.0 on 
pci0
dektec0: [GIANT-LOCKED]
dektec0: [ITHREAD]
dektec0: board model 145, firmware version 2 (tx: 0, rx: 2), tx fifo 16384 MB

Source code here:

  https://github.com/olgeni/freebsd-dektec/blob/master/dektec.c

Can anybody offer a clue about what could be triggering the GIANT
requirement? Could I be doing something that has this, and possibly
other, unintended side effects?

--
jimmy
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: [GIANT-LOCKED] even without D_NEEDGIANT

2013-01-18 Thread Andriy Gapon
on 18/01/2013 13:39 Jimmy Olgeni said the following:
 
 Hello list,
 
 At $DAILY_JOB I got involved with an ASI board that didn't have any kind of
 FreeBSD support, so I ended up writing a driver for it.
 
 If you try to ignore the blatant style(9) violations (of which there are many,
 hopefully on the way to be cleaned up) it seems to work fine.
 
 However, I noticed that when loading the driver I always get a
 message about the giant lock being used, even if D_NEEDGIANT is not
 specified anywhere.
 
 The actual output when loading is this (FreeBSD 9-STABLE i386):
 
 dektec0: DekTec DTA-145 mem 0xfeaff800-0xfeaf irq 16 at device 13.0 on 
 pci0
 dektec0: [GIANT-LOCKED]
 dektec0: [ITHREAD]
 dektec0: board model 145, firmware version 2 (tx: 0, rx: 2), tx fifo 16384 MB
 
 Source code here:
 
   https://github.com/olgeni/freebsd-dektec/blob/master/dektec.c
 
 Can anybody offer a clue about what could be triggering the GIANT
 requirement? Could I be doing something that has this, and possibly
 other, unintended side effects?
 

See INTR_MPSAFE in bus_setup_intr(9).

-- 
Andriy Gapon
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: [GIANT-LOCKED] even without D_NEEDGIANT

2013-01-18 Thread Jimmy Olgeni


On Fri, 18 Jan 2013, Andriy Gapon wrote:


See INTR_MPSAFE in bus_setup_intr(9).


Thanks! It went away. Back to testing...

--
jimmy
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Mark Felder
On Thu, 17 Jan 2013 16:12:17 -0600, Karim Fodil-Lemelin  
fodillemlinka...@gmail.com wrote:


SAS controllers may connect to SATA devices, either directly connected  
using native SATA protocol or through SAS expanders using SATA Tunneled  
Protocol (STP).
 The systems is currently put in place using SATA instead of SAS  
although its using the same interface and backplane connectors and the  
drives (SATA) show as da0 in BSD _but_ with the SATA drive we get *much*  
better performances. I am thinking that something fancy in that SAS  
drive is not being handled correctly by the FreeBSD driver. I am  
planning to revisit the SAS drive issue at a later point (sometimes next  
week).


Your SATA drives are connected directly not with an interposer such as the  
LSISS9252, correct? If so, this might be the cause of your problems.  
Mixing SAS and SATA drives is known to cause serious performance issues  
for almost every JBOD/controller/expander/what-have-you. Change your  
configuration so there is only one protocol being spoken on the bus (SAS)  
by putting your SATA drives behind interposers which translate SAS to SATA  
just before the disk. This will solve many problems.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


kmem_map auto-sizing and size dependencies

2013-01-18 Thread Andre Oppermann

The autotuning work is reaching into many places of the kernel and
while trying to tie up all lose ends I've got stuck in the kmem_map
and how it works or what its limitations are.

During startup the VM is initialized and an initial kernel virtual
memory map is setup in kmem_init() covering the entire KVM address
range.  Only the kernel itself is actually allocated within that
map.  A bit later on a number of other submaps are allocated (clean_map,
buffer_map, pager_map, exec_map).  Also in kmeminit() (in kern_malloc.c,
different from kmem_init) the kmem_map is allocated.

The (inital?) size of the kmem_map is determined by some voodoo magic,
a sprinkle of nmbclusters * PAGE_SIZE incrementor and lots of tunables.
However it seems to work out to an effective kmem_map_size of about 58MB
on my 16GB AMD64 dev machine:

vm.kvm_size: 549755809792
vm.kvm_free: 530233421824
vm.kmem_size: 16,594,300,928
vm.kmem_size_min: 0
vm.kmem_size_max: 329,853,485,875
vm.kmem_size_scale: 1
vm.kmem_map_size: 59,518,976
vm.kmem_map_free: 16,534,777,856

The kmem_map serves kernel malloc (via UMA), contigmalloc and everthing
else that uses UMA for memory allocation.

Mbuf memory too is managed by UMA which obtains the backing kernel memory
from the kmem_map.  The limits of the various mbuf memory types have
been considerably raised recently and may make use of 50-75% of all physically
present memory, or available KVM space, whichever is smaller.

Now my questions/comments are:

 Does the kmem_map automatically extend itself if more memory is requested?

 Should it be set to a larger initial value based on min(physical,KVM) space
 available?

 The use of nmbclusters for the initial kmem_map size calculation isn't
 appropriate anymore due to it being set up later and nmbclusters isn't the
 only mbuf relevant mbuf type.  We make significant use of page sized mbuf
 clusters too.

 The naming and output of the various vm.kmem_* and vm.kvm_* sysctls is
 confusing and not easy to reconcile.  Either we need some more detailing
 more aspects or less.  Plus perhaps sysctl subtrees to better describe the
 hierarchy of the maps.

 Why are separate kmem submaps being used?  Is it to limit memory usage of
 certain subsystems?  Are those limits actually enforced?

--
Andre
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: stupid UFS behaviour on random writes

2013-01-18 Thread Rick Macklem
Stefan Esser wrote:
 Am 18.01.2013 00:01, schrieb Rick Macklem:
  Wojciech Puchar wrote:
  create 10GB file (on 2GB RAM machine, with some swap used to make
  sure
  little cache would be available for filesystem.
 
  dd if=/dev/zero of=file bs=1m count=10k
 
  block size is 32KB, fragment size 4k
 
 
  now test random read access to it (10 threads)
 
  randomio test 10 0 0 4096
 
  normal result on such not so fast disk in my laptop.
 
  118.5 | 118.5 5.8 82.3 383.2 85.6 | 0.0 inf nan 0.0 nan
  138.4 | 138.4 3.9 72.2 499.7 76.1 | 0.0 inf nan 0.0 nan
  142.9 | 142.9 5.4 69.9 297.7 60.9 | 0.0 inf nan 0.0 nan
  133.9 | 133.9 4.3 74.1 480.1 75.1 | 0.0 inf nan 0.0 nan
  138.4 | 138.4 5.1 72.1 380.0 71.3 | 0.0 inf nan 0.0 nan
  145.9 | 145.9 4.7 68.8 419.3 69.6 | 0.0 inf nan 0.0 nan
 
 
  systat shows 4kB I/O size. all is fine.
 
  BUT random 4kB writes
 
  randomio test 10 1 0 4096
 
  total | read: latency (ms) | write: latency (ms)
  iops | iops min avg max sdev | iops min avg max
  sdev
  +---+--
  38.5 | 0.0 inf nan 0.0 nan | 38.5 9.0 166.5 1156.8 261.5
  44.0 | 0.0 inf nan 0.0 nan | 44.0 0.1 251.2 2616.7 492.7
  44.0 | 0.0 inf nan 0.0 nan | 44.0 7.6 178.3 1895.4 330.0
  45.0 | 0.0 inf nan 0.0 nan | 45.0 0.0 239.8 3457.4 522.3
  45.5 | 0.0 inf nan 0.0 nan | 45.5 0.1 249.8 5126.7 621.0
 
 
 
  results are horrific. systat shows 32kB I/O, gstat shows half are
  reads
  half are writes.
 
  Why UFS need to read full block, change one 4kB part and then write
  back, instead of just writing 4kB part?
 
  Because that's the way the buffer cache works. It writes an entire
  buffer
  cache block (unless at the end of file), so it must read the rest of
  the block into
  the buffer, so it doesn't write garbage (the rest of the block) out.
 
 Without having looked at the code or testing:
 
 I assume using O_DIRECT when opening the file should help for that
 particular test (on kernels compiled with options DIRECTIO).
 
  I'd argue that using an I/O size smaller than the file system block
  size is
  simply sub-optimal and that most apps. don't do random I/O of
  blocks.
  OR
  If you had an app. that does random I/O of 4K blocks (at 4K byte
  offsets),
  then using a 4K/1K file system would be better.
 
 A 4k/1k file system has higher overhead (more indirect blocks) and
 is clearly sub-obtimal for most general uses, today.
 
Yes, but if the sysadmin knows that most of the I/O is random 4K blocks,
that's his specific case, not a general use. Sorry, I didn't mean to
imply that a 4K file system was a good choice, in general.

  NFS is the exception, in that it keeps track of a dirty byte range
  within
  a buffer cache block and writes that byte range. (NFS writes are
  byte granular,
  unlike a disk.)
 
 I should be easy to add support for a fragment mask to the buffer
 cache, which allows to identify valid fragments. Such a mask should
 be set to 0xff for all current uses of the buffer cache (meaning the
 full block is valid), but a special case could then be added for
 writes
 of exactly one or multiple fragments, where only the corresponding
 valid flag bits were set. In addition, a possible later read from
 disk must obviously skip fragments for which the valid mask bits are
 already set.
 This bit mask could then be used to update the affected fragments
 only, without a read-modify-write of the containing block.
 
 But I doubt that such a change would improve performance in the
 general case, just in random update scenarios (which might still
 be relevant, in case of a DBMS knowing the fragment size and using
 it for DB files).
 
 Regards, STefan
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to
 freebsd-hackers-unsubscr...@freebsd.org
Yes. And for some I/O patterns the fragment change would degrade performance. 
You mentioned
that a later read might have to skip fragments with the valid bit. I think
this would translate to doing multiple reads for the other fragments, in 
practice.
Also, when an app. goes to write a partial fragment, that fragment would have 
to be read in and
this could result in several reads of fragments instead of one read for the 
entire block.
It's the old OS doesn't have a crystal ball that predicts future I/O activity.

Btw, although I did a dirty byte range for NFS for the buffer cache
ages (late 1980s) ago, it is also a performance hit for certain cases.
The linker/loaders love to write random sized chucks to files. For the
NFS code, if the new write isn't contiguous with the old one, a synchronous
write of the old dirty byte range is forced to the server. I have a patch that
replaces the single byte range with a list in order to avoid this synchronous 
write,
but it has not made it into head. (I hope to do so someday, after more
testing and when I figure out all the implications of 

Re: kmem_map auto-sizing and size dependencies

2013-01-18 Thread Alan Cox
I'll follow up with detailed answers to your questions over the weekend.
For now, I will, however, point out that you've misinterpreted the
tunables.  In fact, they say that your kmem map can hold up to 16GB and the
current used space is about 58MB.  Like other things, the kmem map is
auto-sized based on the available physical memory and capped so as not to
consume too much of the overall kernel address space.

Regards,
Alan

On Fri, Jan 18, 2013 at 9:29 AM, Andre Oppermann an...@freebsd.org wrote:

 The autotuning work is reaching into many places of the kernel and
 while trying to tie up all lose ends I've got stuck in the kmem_map
 and how it works or what its limitations are.

 During startup the VM is initialized and an initial kernel virtual
 memory map is setup in kmem_init() covering the entire KVM address
 range.  Only the kernel itself is actually allocated within that
 map.  A bit later on a number of other submaps are allocated (clean_map,
 buffer_map, pager_map, exec_map).  Also in kmeminit() (in kern_malloc.c,
 different from kmem_init) the kmem_map is allocated.

 The (inital?) size of the kmem_map is determined by some voodoo magic,
 a sprinkle of nmbclusters * PAGE_SIZE incrementor and lots of tunables.
 However it seems to work out to an effective kmem_map_size of about 58MB
 on my 16GB AMD64 dev machine:

 vm.kvm_size: 549755809792
 vm.kvm_free: 530233421824
 vm.kmem_size: 16,594,300,928
 vm.kmem_size_min: 0
 vm.kmem_size_max: 329,853,485,875
 vm.kmem_size_scale: 1
 vm.kmem_map_size: 59,518,976
 vm.kmem_map_free: 16,534,777,856

 The kmem_map serves kernel malloc (via UMA), contigmalloc and everthing
 else that uses UMA for memory allocation.

 Mbuf memory too is managed by UMA which obtains the backing kernel memory
 from the kmem_map.  The limits of the various mbuf memory types have
 been considerably raised recently and may make use of 50-75% of all
 physically
 present memory, or available KVM space, whichever is smaller.

 Now my questions/comments are:

  Does the kmem_map automatically extend itself if more memory is requested?

  Should it be set to a larger initial value based on min(physical,KVM)
 space
  available?

  The use of nmbclusters for the initial kmem_map size calculation isn't
  appropriate anymore due to it being set up later and nmbclusters isn't the
  only mbuf relevant mbuf type.  We make significant use of page sized mbuf
  clusters too.

  The naming and output of the various vm.kmem_* and vm.kvm_* sysctls is
  confusing and not easy to reconcile.  Either we need some more detailing
  more aspects or less.  Plus perhaps sysctl subtrees to better describe the
  hierarchy of the maps.

  Why are separate kmem submaps being used?  Is it to limit memory usage of
  certain subsystems?  Are those limits actually enforced?

 --
 Andre
 __**_
 freebsd-curr...@freebsd.org mailing list
 http://lists.freebsd.org/**mailman/listinfo/freebsd-**currenthttp://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscribe@**
 freebsd.org freebsd-current-unsubscr...@freebsd.org

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: kmem_map auto-sizing and size dependencies

2013-01-18 Thread mdf
On Fri, Jan 18, 2013 at 7:29 AM, Andre Oppermann an...@freebsd.org wrote:
 The (inital?) size of the kmem_map is determined by some voodoo magic,
 a sprinkle of nmbclusters * PAGE_SIZE incrementor and lots of tunables.
 However it seems to work out to an effective kmem_map_size of about 58MB
 on my 16GB AMD64 dev machine:

 vm.kvm_size: 549755809792
 vm.kvm_free: 530233421824
 vm.kmem_size: 16,594,300,928
 vm.kmem_size_min: 0
 vm.kmem_size_max: 329,853,485,875
 vm.kmem_size_scale: 1
 vm.kmem_map_size: 59,518,976
 vm.kmem_map_free: 16,534,777,856

 The kmem_map serves kernel malloc (via UMA), contigmalloc and everthing
 else that uses UMA for memory allocation.

 Mbuf memory too is managed by UMA which obtains the backing kernel memory
 from the kmem_map.  The limits of the various mbuf memory types have
 been considerably raised recently and may make use of 50-75% of all
 physically
 present memory, or available KVM space, whichever is smaller.

 Now my questions/comments are:

  Does the kmem_map automatically extend itself if more memory is requested?

Not that I recall.

  Should it be set to a larger initial value based on min(physical,KVM) space
  available?

It needs to be smaller than the physical space, because the only limit
on the kernel's use of (pinned) memory is the size of the map.  So if
it is too large there is nothing to stop the kernel from consuming all
available memory.  The lowmem handler is called when running out of
virtual space only (i.e. a failure to allocate a range in the map).

  The naming and output of the various vm.kmem_* and vm.kvm_* sysctls is
  confusing and not easy to reconcile.  Either we need some more detailing
  more aspects or less.  Plus perhaps sysctl subtrees to better describe the
  hierarchy of the maps.

  Why are separate kmem submaps being used?  Is it to limit memory usage of
  certain subsystems?  Are those limits actually enforced?

I mostly know about memguard, since I added memguard_fudge().  IIRC
some of the submaps are used.  The memguard_map specifically is used
to know whether an allocation is guarded or not, so at free(9) it can
be handled as normal malloc() or as memguard.

Cheers,
matthew
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Fixing grep -D skip

2013-01-18 Thread John Baldwin
On Thursday, January 17, 2013 9:33:53 pm David Xu wrote:
 I am trying to fix a bug in GNU grep, the bug is if you
 want to skip FIFO file, it will not work, for example:
 
 grep -D skip aaa .
 
 it will be stucked on a FIFO file.
 
 Here is the patch:
 http://people.freebsd.org/~davidxu/patch/grep.c.diff2
 
 Is it fine to be committed ?

I think the first part definitely looks fine.  My guess is the non-blocking 
change is als probably fine, but that should be run by the bsdgrep person at 
least.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Scott Long
Try adding the following to /boot/loader.conf and reboot:

hw.mpt.enable_sata_wc=1

The default value, -1, instructs the driver to leave the STA drives at their 
configuration default.  Often times this means that the MPT BIOS will turn off 
the write cache on every system boot sequence.  IT DOES THIS FOR A GOOD REASON! 
 An enabled write cache is counter to data reliability.  Yes, it helps make 
benchmarks look really good, and it's acceptable if your data can be safely 
thrown away (for example, you're just caching from a slower source, and the 
cache can be rebuilt if it gets corrupted).  And yes, Linux has many tricks to 
make this benchmark look really good.  The tricks range from buffering the raw 
device to having 'dd' recognize the requested task and short-circuit the 
process of going to /dev/null or pulling from /dev/zero.  I can't tell you how 
bogus these tests are and how completely irrelevant they are in predicting 
actual workload performance.  But, I'm not going to stop anyone from trying, so 
give the above tunable a try
 and let me know how it works.

Btw, I'm not subscribed to the hackers mailing list, so please redistribute 
this email as needed.

Scott





 From: Dieter BSD dieter...@gmail.com
To: freebsd-hackers@freebsd.org 
Cc: mja...@freebsd.org; gi...@freebsd.org; sco...@freebsd.org 
Sent: Thursday, January 17, 2013 9:03 PM
Subject: Re: IBM blade server abysmal disk write performances
 
 I am thinking that something fancy in that SAS drive is
 not being handled correctly by the FreeBSD driver.

I think so too, and I think the something fancy is tagged command queuing.
The driver prints da0: Command Queueing enabled and yet your SAS drive
is only getting 1 write per rev, and queuing should get you more than that.
Your SATA drive is getting the expected performance, which means that NCQ
must be working.

 Please let me know if there is anything you would like me to run on the
 BSD 9.1 system to help diagnose this issue?

Looking at the mpt driver, a verbose boot may give more info.
Looks like you can set a debug device hint, but I don't
see any documentation on what to set it to.

I think it is time to ask the driver wizards why TCQ isn't working,
so I'm cc-ing the authors listed on the mpt man page.



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Wojciech Puchar


The default value, -1, instructs the driver to leave the STA drives at their 
configuration default.  Often times this means that the MPT BIOS will turn off 
the write cache on every system boot sequence.  IT DOES THIS FOR A GOOD REASON! 
 An enabled write cache is counter to data reliability.  Yes, it helps make 
benchmarks look really good, and it's acceptable if your data can be safely 
thrown away (for example, you're just caching from a slower source, and the 
cache can be rebuilt if it gets corrupted).  And yes, Linux has many tricks to 
make this benchmark look really good.  The tricks range from buffering the raw 
device to having 'dd' recognize the requested task and short-circuit the 
process of going to /dev/null or pulling from /dev/zero.  I can't tell you how 
bogus these tests are and how completely irrelevant they are in predicting 
actual workload performance.  But, I'm not going to stop anyone from trying, so 
give the above tunable a try
and let me know how it works.

If computer have UPS then write caching is fine. even if FreeBSD crash, 
disk would write data___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Scott Long
- Original Message -

 From: Wojciech Puchar woj...@wojtek.tensor.gdynia.pl
 To: Scott Long scott4l...@yahoo.com
 Cc: Dieter BSD dieter...@gmail.com; freebsd-hackers@freebsd.org 
 freebsd-hackers@freebsd.org; gi...@freebsd.org gi...@freebsd.org; 
 sco...@freebsd.org sco...@freebsd.org; mja...@freebsd.org 
 mja...@freebsd.org
 Sent: Friday, January 18, 2013 11:10 AM
 Subject: Re: IBM blade server abysmal disk write performances
 
 
  The default value, -1, instructs the driver to leave the STA drives at 
 their configuration default.  Often times this means that the MPT BIOS will 
 turn 
 off the write cache on every system boot sequence.  IT DOES THIS FOR A GOOD 
 REASON!  An enabled write cache is counter to data reliability.  Yes, it 
 helps 
 make benchmarks look really good, and it's acceptable if your data can be 
 safely thrown away (for example, you're just caching from a slower source, 
 and the cache can be rebuilt if it gets corrupted).  And yes, Linux has many 
 tricks to make this benchmark look really good.  The tricks range from 
 buffering 
 the raw device to having 'dd' recognize the requested task and 
 short-circuit the process of going to /dev/null or pulling from /dev/zero.  I 
 can't tell you how bogus these tests are and how completely irrelevant they 
 are in predicting actual workload performance.  But, I'm not going to stop 
 anyone from trying, so give the above tunable a try
  and let me know how it works.
 
 If computer have UPS then write caching is fine. even if FreeBSD crash, 
 disk would write data
 

I suspect that I'm encountering situations right now at netflix where this 
advice is not true.  I have drives that are seeing intermittent errors, then 
being forced into reset after a timeout, and then coming back up with 
filesystem problems.  It's only a suspicion at this point, not a confirmed case.

Scott

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Wojciech Puchar

disk would write data



I suspect that I'm encountering situations right now at netflix where this 
advice is not true.  I have drives that are seeing intermittent errors, then 
being forced into reset after a timeout, and then coming back up with 
filesystem problems.  It's only a suspicion at this point, not a confirmed case.

true. I just assumed that anywhere it matters one would use gmirror.
As for myself - i always prefer to put different manufacturers drives for 
gmirror or at least - not manufactured at similar time.


2 fails at the same moment is rather unlikely. Of course - everything is 
possible so i do proper backups to remote sites. Remote means another 
city.___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Dieter BSD
Wojciech writes:
 If computer have UPS then write caching is fine. even if FreeBSD crash,
 disk would write data

That is incorrect.  A UPS reduces the risk, but does not eliminate it.
It is impossible to completely eliminate the risk of having the
write cache on.  If you care about your data you must turn the disk's
write cache off.

If you are using the drive in an application where the data does
not matter, or can easily be regenerated (e.g. disk duplication,
if it fails, just start over), then turning the write cache on
for that one drive can be ok. There is a patch that allows turning
the write cache on and off on a per drive basis. The patch is for
ata(4), but should be possible with other drivers.  camcontrol(8)
may work for SCSI and SAS drives. I have yet to see a USB-to-*ATA
bridge that allows turning the write cache off, so USB disks are
useless for most applications.

But for most applications, you must have the write cache off,
and you need queuing (e.g. TCQ or NCQ) for performance.  If
you have queuing, there is no need to turn the write cache
on.

It is inexcusable that FreeBSD defaults to leaving the write cache on
for SATA  PATA drives.  At least the admin can easily fix this by
adding hw.ata.wc=0 to /boot/loader.conf.  The bigger problem is that
FreeBSD does not support queuing on all controllers that support it.
Not something that admins can fix, and inexcusable for an OS that
claims to care about performance.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Ian Lepore
On Fri, 2013-01-18 at 20:37 +0100, Wojciech Puchar wrote:
  disk would write data
 
 
  I suspect that I'm encountering situations right now at netflix where this 
  advice is not true.  I have drives that are seeing intermittent errors, 
  then being forced into reset after a timeout, and then coming back up with 
  filesystem problems.  It's only a suspicion at this point, not a confirmed 
  case.
 true. I just assumed that anywhere it matters one would use gmirror.
 As for myself - i always prefer to put different manufacturers drives for 
 gmirror or at least - not manufactured at similar time.
 

That is good advice.  I bought six 1TB drives at the same time a few
years ago and received drives with consequtive serial numbers.  They
were all part of the same array, and they all failed (click of death)
within a six hour timespan of each other.  Luckily I noticed the
clicking right away and was able to get all the data copied to another
array within a few hours, before they all died.

-- Ian

 2 fails at the same moment is rather unlikely. Of course - everything is 
 possible so i do proper backups to remote sites. Remote means another 
 city.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Wojciech Puchar


That is incorrect.  A UPS reduces the risk, but does not eliminate it.


nothing eliminate all risks.


But for most applications, you must have the write cache off,
and you need queuing (e.g. TCQ or NCQ) for performance.  If
you have queuing, there is no need to turn the write cache
on.
did you tested the above claim? i have SATA drives everywhere, all in ahci 
mode, all with NCQ active.






It is inexcusable that FreeBSD defaults to leaving the write cache on
for SATA  PATA drives.  At least the admin can easily fix this by
adding hw.ata.wc=0 to /boot/loader.conf.  The bigger problem is that
FreeBSD does not support queuing on all controllers that support it.

i must be happy as i never had a case of not seeing
adaX: Command Queueing enabled
on my machines.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Scott Long

On Jan 18, 2013, at 1:12 PM, Dieter BSD dieter...@gmail.com wrote:
 It is inexcusable that FreeBSD defaults to leaving the write cache on
 for SATA  PATA drives.

This was completely driven by the need to satisfy idiotic benchmarkers,
tech writers, and system administrators.  It was a huge deal for FreeBSD
4.4, IIRC.  It had been silently enabled it, we turned it off, released 4.4,
and then got murdered in the press for being slow.

If I had my way, the WC would be off, everyone would be using SAS,
and anyone who enabled SATA WC or complained about I/O slowness
would be forced into Siberian salt mines for the remainder of their lives.


  At least the admin can easily fix this by
 adding hw.ata.wc=0 to /boot/loader.conf.  The bigger problem is that
 FreeBSD does not support queuing on all controllers that support it.
 Not something that admins can fix, and inexcusable for an OS that
 claims to care about performance.

You keep saying this, but I'm unclear on what you mean.  Can you
explain?

Scott

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Wojciech Puchar

and anyone who enabled SATA WC or complained about I/O slowness
would be forced into Siberian salt mines for the remainder of their lives.


so reserve a place for me there.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Ian Lepore
On Fri, 2013-01-18 at 22:18 +0100, Wojciech Puchar wrote:
  and anyone who enabled SATA WC or complained about I/O slowness
  would be forced into Siberian salt mines for the remainder of their lives.
 
 so reserve a place for me there.

Yeah, me too.  I prefer to go for all-out performance with separate risk
mitigation strategies.  I wouldn't set up a client datacenter that way,
but it's wholly appropriate for what I do with this machine.

-- Ian


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Peter Jeremy
On 2013-Jan-18 12:12:11 -0800, Dieter BSD dieter...@gmail.com wrote:
adding hw.ata.wc=0 to /boot/loader.conf.  The bigger problem is that
FreeBSD does not support queuing on all controllers that support it.
Not something that admins can fix, and inexcusable for an OS that
claims to care about performance.

Apart from continuous whinging and whining on mailing lists, what have
you done to add support for queuing?

-- 
Peter Jeremy


pgpPelv8iAQPo.pgp
Description: PGP signature


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Karim Fodil-Lemelin

On 18/01/2013 10:16 AM, Mark Felder wrote:
On Thu, 17 Jan 2013 16:12:17 -0600, Karim Fodil-Lemelin 
fodillemlinka...@gmail.com wrote:


SAS controllers may connect to SATA devices, either directly 
connected using native SATA protocol or through SAS expanders using 
SATA Tunneled Protocol (STP).
 The systems is currently put in place using SATA instead of SAS 
although its using the same interface and backplane connectors and 
the drives (SATA) show as da0 in BSD _but_ with the SATA drive we get 
*much* better performances. I am thinking that something fancy in 
that SAS drive is not being handled correctly by the FreeBSD driver. 
I am planning to revisit the SAS drive issue at a later point 
(sometimes next week).


Your SATA drives are connected directly not with an interposer such as 
the LSISS9252, correct? If so, this might be the cause of your 
problems. Mixing SAS and SATA drives is known to cause serious 
performance issues for almost every 
JBOD/controller/expander/what-have-you. Change your configuration so 
there is only one protocol being spoken on the bus (SAS) by putting 
your SATA drives behind interposers which translate SAS to SATA just 
before the disk. This will solve many problems.
Not sure what you mean by this but isn't the mpt detecting an interposer 
in this line:


mpt0: LSILogic SAS/SATA Adapter port 0x1000-0x10ff mem 
0x9991-0x99913fff,0x9990-0x9990 irq 28 at device 0.0 on pci11

mpt0: MPI Version=1.5.20.0
mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 )
mpt0: 0 Active Volumes (2 Max)
mpt0: 0 Hidden Drive Members (14 Max)

Also please not SATA speed in that same hardware setup works just fine. 
In any case I will have a look.


Thanks,

Karim.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Matthew Jacob
This is all turning into a bikeshed discussion. As far as I can tell, 
the basic original question was why a *SAS* (not a SATA) drive was not 
performing as well as expected based upon experiences with Linux. I 
still don't know whether reads or writes were being used for dd.


This morning, I ran a fio test with a single threaded read component and 
a multithreaded write component to see if there were differences. All I 
had connected to my MPT system were ATA drives (Seagate 500GBs) and I'm 
remote now and won't be back until Sunday to put one of my 'good' SAS 
drives (140 GB Seagates, i.e., real SAS 15K RPM drives, not fat SATA 
bs drives).


The numbers were pretty much the same for both FreeBSD and Linux. In 
fact, FreeBSD was slightly faster. I won't report the exact numbers 
right now, but only mention this as a piece of information that at least 
in my case the differences between the OS platform involved is 
negligible. This would, at least in my case, rule out issues based upon 
different platform access methods and different drivers.


All of this other discussion, about WCE and what not is nice, but for 
all intents and purposes it serves could be moved to *-advocacy.


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Matthew Jacob




mpt0: LSILogic SAS/SATA Adapter port 0x1000-0x10ff mem 
0x9991-0x99913fff,0x9990-0x9990 irq 28 at device 0.0 on pci11

mpt0: MPI Version=1.5.20.0
mpt0: Capabilities: ( RAID-0 RAID-1E RAID-1 )
mpt0: 0 Active Volumes (2 Max)
mpt0: 0 Hidden Drive Members (14 Max)
Ah. Historically IBM systems (the 335, for one) have been very slow with 
the Integrated Raid software, at least on FreeBSD.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Fixing grep -D skip

2013-01-18 Thread Xin Li
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 01/18/13 08:39, John Baldwin wrote:


I (disclaimer: not bsdgrep person) have just tested that bsdgrep
handle this case just fine.

The non-blocking part is required to make the code function, otherwise
the system will block on open() if fifo don't have another opener.
I'd say Yes for this p

Cheers,
- -- 
Xin LI delp...@delphij.nethttps://www.delphij.net/
FreeBSD - The Power to Serve!   Live free or die
-BEGIN PGP SIGNATURE-

iQEcBAEBCgAGBQJQ+eW5AAoJEG80Jeu8UPuzB3IH/RmhJionoEWRtczBy2ccA8sl
XG1OIvSR60vWNFAGooOF2I66J8xF0/+f/4xDwN3C56kIweN3XgvxSmOCCM3aUab3
eaAdOIoWAkNb3r4iMxFCJNo6YKuiLTiw8vEdcqjXsqrHzzAMtk81jqSpw0ZkVJM2
upPWF9EItlyKDSOfLCVZiL9qxUxppV+xTpVpMd1F/ud7cQMBaAiU2/pyOgcZDLet
GVp4dninbxn3+YN7DU/yvjBnhWWVCrHfbOl5C6zNgrfzfLDyxrP+G67DHCFF9VnU
1l31FOXdd6ThChxfiH3F6QZ7KL0ncDd1pH+qvaoQo7KZBq6jEoiwplaq6qKP4xk=
=zQj9
-END PGP SIGNATURE-
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Fixing grep -D skip

2013-01-18 Thread Xin Li
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

On 01/18/13 08:39, John Baldwin wrote:
 On Thursday, January 17, 2013 9:33:53 pm David Xu wrote:
 I am trying to fix a bug in GNU grep, the bug is if you want to 
 skip FIFO file, it will not work, for example:
 
 grep -D skip aaa .
 
 it will be stucked on a FIFO file.
 
 Here is the patch: 
 http://people.freebsd.org/~davidxu/patch/grep.c.diff2
 
 Is it fine to be committed ?
 
 I think the first part definitely looks fine.  My guess is the 
 non-blocking change is als probably fine, but that should be run
 by the bsdgrep person at least.

I (disclaimer: not bsdgrep person) have just tested that bsdgrep
handle this case just fine.

The non-blocking part is required to make the code function, otherwise
the system will block on open() if fifo don't have another opener.
I'd say Yes for this p

Cheers,
- -- 
Xin LI delp...@delphij.nethttps://www.delphij.net/
FreeBSD - The Power to Serve!   Live free or die
a
-BEGIN PGP SIGNATURE-

iQEcBAEBCgAGBQJQ+eW5AAoJEG80Jeu8UPuzaPkH/RPnvBg5pDPxmbSXWF3T22s3
XTPNfDns416g6dqig+E+YOhamu+Pz8xFC6JCu3DzPbNcb+OGRh14LBFeZQ6xn648
yxn1j0Y2ZsmjoppMAWg+wuwLtOYX0pK69zZzOxQMepBeA/rkA26hJA/3j6VTPu/X
hLFP+bRy+wt8Ni39PuSrBywuPmwg82de+Fuf8WVVVwXgXHnK+yc/Pb1JWgiU6kzz
r1tyCAh2rXcM4mg++LUoeYZZhrLuxWKKPrXkzGSbz7NSPXJccwf5rx/ZPB2EysVv
Z/CA6wS2jqsOUbyelM01jtvrY6Q6llLIIEc3aGPcjYZbqy/B0VLwyGnR+rElKBo=
=M7oI
-END PGP SIGNATURE-
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Dieter BSD
Scott writes:
 If I had my way, the WC would be off, everyone would be using SAS,
 and anyone who enabled SATA WC or complained about I/O slowness
 would be forced into Siberian salt mines for the remainder of their lives.

Actually, If you are running SAS, having SATA WC on or off wouldn't
matter, it would be SCSI's WC you'd care about.  :-)

 The bigger problem is that
 FreeBSD does not support queuing on all controllers that support it.
 Not something that admins can fix, and inexcusable for an OS that
 claims to care about performance.

 You keep saying this, but I'm unclear on what you mean.  Can you
 explain?

For most applications you need the write cache to be off.
Having the write cache off is fine as long as you have queuing.
But with the write cache off, if you don't have queuing, performance
sucks. Like getting only 6% of the performance you should be getting.
Some of the early SATA controllers didn't have NCQ.  Knowing that
queuing was very important, I made sure to choose a mainboard with
NCQ, giving up other useful features to get it.  But FreeBSD does
not support NCQ on the nforce4-ultra's SATA controllers.  Even the
sad joke of an OS Linux has had NCQ on nforce4 since Oct 2006.
But Linux is such crap it is unusable.  Linux is slowly improving,
but I don't expect to live long enough to see it become usable.
Seriously. I've tried it several times but I have completely
given up on it.  Anyway, even after all these years the supposedly
performance oriented FreeBSD still does not support NCQ on nforce4,
which isn't some obscure chip. they sold a lot them.  I've added
3 additional SATA controllers on expansion cards, and FreeBSD
supports NCQ on them, so the slow controllers limited by PCIe-x1
have much better write performance than the much faster controllers
in the chipset with all the bandwidth they need.  I can't add
more controllers, there aren't any free slots.  The nforce
will remain in service for years, aside from the monetary cost,
silicon has a huge amount of environmental cost: embedded energy,
water, pollution, etc.  And there are a lot of them.

Wojciech writes:
 That is incorrect.  A UPS reduces the risk, but does not eliminate it.

 nothing eliminate all risks.

Turning the write cache off eliminates the risk of having the write cache
on.  Yes you can still lose data for other reasons.  Backups are still a
good idea.

 But for most applications, you must have the write cache off,
 and you need queuing (e.g. TCQ or NCQ) for performance.  If
 you have queuing, there is no need to turn the write cache
 on.

 did you tested the above claim? i have SATA drives everywhere, all in ahci
 mode, all with NCQ active.

Yes, turn the write cache off and NCQ will give you the performance.
As long as you have queuing you can have the best of both worlds.

Which is why Karim's problem is so odd.  Driver says there is queuing,
but performance (1 write per rev) looks exactly like there is no queuing.
Maybe there is something else that causes only 1 write per rev but
I don't know what that might be.

Peter writes:
 Apart from continuous whinging and whining on mailing lists, what have
 you done to add support for queuing?

Submitted PR, it was closed without being fixed.  Looked at code,
but Greek to me, even though I have successfully modified a BSD based device
driver in the past giving major performance improvement.  If I were
a C-level exec of a Fortune 500 company I'd just hire some device driver
wizard.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Karim Fodil-Lemelin

On 18/01/2013 5:42 PM, Matthew Jacob wrote:
This is all turning into a bikeshed discussion. As far as I can tell, 
the basic original question was why a *SAS* (not a SATA) drive was not 
performing as well as expected based upon experiences with Linux. I 
still don't know whether reads or writes were being used for dd.


This morning, I ran a fio test with a single threaded read component 
and a multithreaded write component to see if there were differences. 
All I had connected to my MPT system were ATA drives (Seagate 500GBs) 
and I'm remote now and won't be back until Sunday to put one of my 
'good' SAS drives (140 GB Seagates, i.e., real SAS 15K RPM drives, not 
fat SATA bs drives).


The numbers were pretty much the same for both FreeBSD and Linux. In 
fact, FreeBSD was slightly faster. I won't report the exact numbers 
right now, but only mention this as a piece of information that at 
least in my case the differences between the OS platform involved is 
negligible. This would, at least in my case, rule out issues based 
upon different platform access methods and different drivers.


All of this other discussion, about WCE and what not is nice, but for 
all intents and purposes it serves could be moved to *-advocacy.



Thanks for the clarifications!

I did mention at some point those were write speeds and reads were just 
fine and those were either writes to the filesystem or direct access 
(only on SAS again).


Here is what I am planning to do next week when I get the chance:

0) I plan on focusing on the SAS driver tests _only_ since SATA is 
working as expected so nothing to report there.
1) Look carefully at how the drives are physically connected. Although 
it feels like if the SATA works fine the SAS should also but I'll check 
anyway.
2) Boot verbose with boot -v and send the dmesg output. mpt driver 
might give us a clue.
3) Run gstat -abc in a loop for the test duration. Although I would 
think ctlstat(8) might be more interesting here so I'll run it too for 
good measure :).


Please note that in all tests write caching was enabled as I think this 
is the default with FBSD 9.1 GENERIC but I'll confirm this with 
camcontrol(8).


I've also seen quite a lot of 'quirks' for tagged command queuing in the 
source code (/sys/cam/scsi/scps_xtp.c) but a particular one got my 
attention (thanks to whomever writes good comments in source code :) :


/*
 * Slow when tagged queueing is enabled. Write performance
 * steadily drops off with more and more concurrent
 * transactions.  Best sequential write performance with
 * tagged queueing turned off and write caching turned on.
 *
 * PR:  kern/10398
 * Submitted by:  Hideaki Okada hok...@isl.melco.co.jp
 * Drive:  DCAS-34330 w/ S65A firmware.
 *
 * The drive with the problem had the S65A firmware
 * revision, and has also been reported (by Stephen J.
 * Roznowski s...@home.net) for a drive with the S61A
 * firmware revision.
 *
 * Although no one has reported problems with the 2 gig
 * version of the DCAS drive, the assumption is that it
 * has the same problems as the 4 gig version.  Therefore
 * this quirk entries disables tagged queueing for all
 * DCAS drives.
 */
{ T_DIRECT, SIP_MEDIA_FIXED, IBM, DCAS*, * },
/*quirks*/0, /*mintags*/0, /*maxtags*/0

So I looked at the kern/10398 pr and got some feeling of 'deja vu' 
although the original problem was on FreeBSD 3.1 so its most likely not 
that but I though I would mention it. The issue described is awfully 
familiar. Basically the SAS drive (scsi back then) is slow on writes but 
fast on reads with dd. Could be a coincidence or a ghost from the past 
who knows...


Cheers,

Karim.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Dieter BSD
Matthew writes:
 There is also no information in the original email as to which direction
 the I/O was being sent.

In one of the followups, Karim reported:
  # dd if=/dev/zero of=foo count=10 bs=1024000
  10+0 records in
  10+0 records out
  1024 bytes transferred in 19.615134 secs (522046 bytes/sec)

522 KB/s is pathetic.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Getting the current thread ID without a syscall?

2013-01-18 Thread Julian Elischer

On 1/15/13 4:03 PM, Trent Nelson wrote:

On Tue, Jan 15, 2013 at 02:33:41PM -0800, Ian Lepore wrote:

On Tue, 2013-01-15 at 14:29 -0800, Alfred Perlstein wrote:

On 1/15/13 1:43 PM, Konstantin Belousov wrote:

On Tue, Jan 15, 2013 at 04:35:14PM -0500, Trent Nelson wrote:

  Luckily it's for an open source project (Python), so recompilation
  isn't a big deal.  (I also check the intrinsic result versus the
  syscall result during startup to verify the same ID is returned,
  falling back to the syscall by default.)

For you, may be. For your users, it definitely will be a problem.
And worse, the problem will be blamed on the operating system and not
to the broken application.


Anything we can do to avoid this would be best.

The reason is that we are still dealing with an optimization that perl
did, it reached inside of the opaque struct FILE to do nasty things.
Now it is very difficult for us to fix struct FILE.

We are still paying for this years later.

Any way we can make this a supported interface?

-Alfred

Re-reading the original question, I've got to ask why pthread_self()
isn't the right answer?  The requirement wasn't I need to know what the
OS calls me it was I need a unique ID per thread within a process.

 The identity check is performed hundreds of times per second.  The
 overhead of (Py_MainThreadId == __readgsdword(0x48) ? A() : B()) is
 negligible -- I can't say the same for a system/function call.

 (I'm experimenting with an idea I had to parallelize Python such
  that it can exploit all cores without impeding the performance
  of normal single-threaded execution (like previous-GIL-removal
  attempts and STM).  It's very promising so far -- presuming we
  can get the current thread ID in a couple of instructions.  If
  not, single-threaded performance suffers too much.)


TLS?



 Trent.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: IBM blade server abysmal disk write performances

2013-01-18 Thread Adrian Chadd
On 18 January 2013 19:11, Dieter BSD dieter...@gmail.com wrote:
 Matthew writes:
 There is also no information in the original email as to which direction
 the I/O was being sent.

 In one of the followups, Karim reported:
   # dd if=/dev/zero of=foo count=10 bs=1024000
   10+0 records in
   10+0 records out
   1024 bytes transferred in 19.615134 secs (522046 bytes/sec)

 522 KB/s is pathetic.

When this is running, use gstat and see exactly how many IOPS/sec
there are and the average io size is.

Yes, 522kbytes/sec is really pathetic, but there's a lot of potential
reasons for that.


adrian
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org