Re: amrd disk performance drop after running under high load

2007-10-17 Thread Kris Kennaway

Alexey Popov wrote:

This is very unlikely, because I have 5 another video storage servers 
of the same hardware and software configurations and they feel good.
Clearly something is different about them, though.  If you can 
characterize exactly what that is then it will help.
I can't see any difference but a date of installation. Really I compared 
all parameters and got nothing interesting.


At first glance one can say that problem is in Dell's x850 series or 
amr(4), but we run this hardware on many other projects and they work 
well. Also Linux on them works.


OK but there is no evidence in what you posted so far that amr is 
involved in any way.  There is convincing evidence that it is the mbuf 
issue.

Why are you sure this is the mbuf issue?


Because that is the only problem shown in the data you posted.

 For example, if there is a real
problem with amr or VM causing disk slowdown, then when it occurs the 
network subsystem will have another load pattern. Instead of just quick 
sending large amounts of data, the system will have to accept large 
amount of sumultaneous connections waiting for data. Can this cause high 
mbuf contention?


I'd expect to see evidence of the main problem.

And few hours ago I received feed back from Andrzej Tobola, he has 
the same problem on FreeBSD 7 with Promise ATA software mirror:
Well, he didnt provide any evidence yet that it is the same problem, 
so let's not become confused by feelings :)
I think he is telling about 100% disk busy while processing ~5 
transfers/sec.


% busy as reported by gstat doesn't mean what you think it does.  What 
is the I/O response time?  That's the meaningful statistic for 
evaluating I/O load.  Also you didnt post about this.


So I can conclude that FreeBSD has a long standing bug in VM that 
could be triggered when serving large amount of static data (much 
bigger than memory size) on high rates. Possibly this only applies to 
large files like mp3 or video. 

It is possible, we have further work to do to conclude this though.
I forgot to mention I have pmc and kgmon profiling for good and bad 
times. But I have not enough knowledge to interpret it right and not 
sure if it can help.


pmc would be useful.

Also now I run nginx instead of lighttpd on one of the problematic 
servers. It seems to work much better - sometimes there is a peaks in 
disk load, but disk does not become very slow and network output does 
not change. The difference of nginx is that it runs in multiple 
processes, while lighttpd by default has only one process. Now I 
configured lighttpd on other server to run in multiple workers. I'll see 
if it helps.


What else can i try?


Still waiting on the vmstat -z output.

Kris

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: amrd disk performance drop after running under high load

2007-10-17 Thread Kris Kennaway

Kris Kennaway wrote:


What else can i try?


Still waiting on the vmstat -z output.


Also can you please obtain vmstat -i, netstat -m and 10 seconds of 
representative vmstat -w output when the problem is and is not occurring?


Kris

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Filesystem snapshots dog slow

2007-10-17 Thread Peter Jeremy
On 2007-Oct-16 06:54:11 -0500, Eric Anderson [EMAIL PROTECTED] wrote:
will give you a good understanding of what the issue is. Essentially, your 
disk is hammered making copies of all the cylinder groups, skipping those 
that are 'busy', and coming back to them later. On a 200Gb disk, you could 
have 1000 cylinder groups, each having to be locked, copied, unlocked, and 
then checked again for any subsequent changes.  The stalls you see are when 
there are lock contentions, or disk IO issues.  On a single disk (like your 
setup above), your snapshots will take forever since there is very little 
random IO performance available to you.

That said, there is a fair amount of scope available for improving
both the creation and deletion performance.

Firstly, it's not clear to me that having more than a few hundred CGs
has any real benefits.  There was a massive gain in moving from
(effectively) a single CG in pre-FFS to a few dozen CGs in FFS as it
was first introduced.  Modern disks are roughly 5 orders of magnitude
larger and voice-coil actuators mean that seek times are almost
independent of distance.  CG sizes are currently limited by the
requirement that the cylinder group (including cylinder group maps)
must fit into a single FS block.  Removing this restriction would
allow CGs to be much larger.

Secondly, all the I/O during both snapshot creation and deletion is
in FS-block size chunks.  Increasing the I/O size would significantly
increase the I/O performance.  Whilst it doesn't make sense to read
more than you need, there still appears to be plenty of scope to
combine writes.

Between these two items, I would expect potential performance gains
of at least 20:1.

Note that I'm not suggesting that either of these items is trivial.

-- 
Peter Jeremy


pgpE9lGevDiGu.pgp
Description: PGP signature


Re: Interrupt/speed problems with 6.2 NFS server

2007-10-17 Thread Eric Anderson

Doug Clements wrote:

Hi,
   I have an new NFS server that is processing roughly 15mbit of NFS traffic
that we recently upgraded from an older 4.10 box. It has a 3-ware raid card,
and is serving NFS out a single em nic to LAN clients. The machine works
great just serving NFS, but when I try to copy data from one raid volume to
another for backups, the machine's NFS performance goes way down, and
NFSops start taking multiple seconds to perform. The file copy goes
quite
quickly, as would be expected. The console of the machine also starts to lag
pretty badly, and I get the 'typing through mud' effect. I use rdist6 to do
the backup.

My first impression was that I was having interrupt issues, since during the
backup, the em interfaces were pushing over 200k interrupts/sec (roughly 60%
CPU processing interrupts). So I recompiled the kernel with polling enabled
and enabled it on the NICs. The strange thing is that polling shows enabled
in ifconfig, but systat -vm still shows the same amount of interrupts. I get
the same performance with polling enabled.

I'm looking for some guidance on why the machine bogs so much during what
seems to me to be something that should barely impact machine performance at
all, and also why polling didn't seem to lower the number of interrupts
processed. The old machine was 6 years old running an old intel raid5, and
it handled NFS and the concurrent file copies without a sweat.

My 3ware is setup as follows:
a 2 disk mirror, for the system
a 4 disk raid10, for /mnt/data1
a 4 disk raid10, for /mnt/data2

Copyright (c) 1992-2007 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 6.2-RELEASE-p8 #0: Thu Oct 11 10:43:22 PDT 2007
[EMAIL PROTECTED] :/usr/obj/usr/src/sys/MADONNA
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: Genuine Intel(R) CPU  @ 2.66GHz (2670.65-MHz K8-class
CPU)
  Origin = GenuineIntel  Id = 0x6f4  Stepping = 4
  Features=0xbfebfbffFPU,VME,DE
,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE

Features2=0x4e3bdSSE3,RSVD2,MON,DS_CPL,VMX,EST,TM2,b9,CX16,b14,b15,b18

  AMD Features=0x20100800SYSCALL,NX,LM
  AMD Features2=0x1LAHF
  Cores per package: 2
real memory  = 4831838208 (4608 MB)
avail memory = 4125257728 (3934 MB)
ACPI APIC Table: INTEL  S5000PSL
ioapic0 Version 2.0 irqs 0-23 on motherboard
ioapic1 Version 2.0 irqs 24-47 on motherboard
lapic0: Forcing LINT1 to edge trigger
kbd1 at kbdmux0
ath_hal: 0.9.17.2 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413)
acpi0: INTEL S5000PSL on motherboard
acpi0: Power Button (fixed)
Timecounter ACPI-fast frequency 3579545 Hz quality 1000
acpi_timer0: 24-bit timer at 3.579545MHz port 0x408-0x40b on acpi0
cpu0: ACPI CPU on acpi0
acpi_throttle0: ACPI CPU Throttling on cpu0
acpi_button0: Sleep Button on acpi0
acpi_button1: Power Button on acpi0
pcib0: ACPI Host-PCI bridge port 0xca2,0xca3,0xcf8-0xcff on acpi0
pci0: ACPI PCI bus on pcib0
pcib1: ACPI PCI-PCI bridge at device 2.0 on pci0
pci1: ACPI PCI bus on pcib1
pcib2: ACPI PCI-PCI bridge irq 16 at device 0.0 on pci1
pci2: ACPI PCI bus on pcib2
pcib3: ACPI PCI-PCI bridge irq 16 at device 0.0 on pci2
pci3: ACPI PCI bus on pcib3
pcib4: ACPI PCI-PCI bridge irq 17 at device 1.0 on pci2
pci4: ACPI PCI bus on pcib4
pcib5: ACPI PCI-PCI bridge irq 18 at device 2.0 on pci2
pci5: ACPI PCI bus on pcib5
em0: Intel(R) PRO/1000 Network Connection Version - 6.2.9 port
0x3020-0x303f mem 0xf882-0xf883,0xf840-0xf87f irq 18 at
device 0.0 on pci5
em0: Ethernet address: 00:15:17:21:bf:30
em1: Intel(R) PRO/1000 Network Connection Version - 6.2.9 port
0x3000-0x301f mem 0xf880-0xf881,0xf800-0xf83f irq 19 at
device 0.1 on pci5
em1: Ethernet address: 00:15:17:21:bf:31
pcib6: ACPI PCI-PCI bridge at device 0.3 on pci1
pci6: ACPI PCI bus on pcib6
3ware device driver for 9000 series storage controllers, version:
3.60.02.012
twa0: 3ware 9000 series Storage Controller port 0x2000-0x203f mem
0xfa00-0xfbff,0xf890-0xf8900fff irq 26 at device 2.0 on pci6
twa0: [GIANT-LOCKED]
twa0: INFO: (0x15: 0x1300): Controller details:: Model 9550SX-12, 12 ports,
Firmware FE9X 3.08.00.004, BIOS BE9X 3.08.00.002
pcib7: PCI-PCI bridge at device 3.0 on pci0
pci7: PCI bus on pcib7
pcib8: ACPI PCI-PCI bridge at device 4.0 on pci0
pci8: ACPI PCI bus on pcib8
pcib9: ACPI PCI-PCI bridge at device 5.0 on pci0
pci9: ACPI PCI bus on pcib9
pcib10: ACPI PCI-PCI bridge at device 6.0 on pci0
pci10: ACPI PCI bus on pcib10
pcib11: PCI-PCI bridge at device 7.0 on pci0
pci11: PCI bus on pcib11
pci0: base peripheral at device 8.0 (no driver attached)
pcib12: ACPI PCI-PCI bridge irq 16 at device 28.0 on pci0
pci12: ACPI PCI bus on pcib12
uhci0: UHCI (generic) USB controller port 0x4080-0x409f irq 23 at device
29.0 on pci0
uhci0: 

Re: Filesystem snapshots dog slow

2007-10-17 Thread Eric Anderson

Kostik Belousov wrote:

On Wed, Oct 17, 2007 at 08:00:03PM +1000, Peter Jeremy wrote:

On 2007-Oct-16 06:54:11 -0500, Eric Anderson [EMAIL PROTECTED] wrote:
will give you a good understanding of what the issue is. Essentially, your 
disk is hammered making copies of all the cylinder groups, skipping those 
that are 'busy', and coming back to them later. On a 200Gb disk, you could 
have 1000 cylinder groups, each having to be locked, copied, unlocked, and 
then checked again for any subsequent changes.  The stalls you see are when 
there are lock contentions, or disk IO issues.  On a single disk (like your 
setup above), your snapshots will take forever since there is very little 
random IO performance available to you.

That said, there is a fair amount of scope available for improving
both the creation and deletion performance.

Firstly, it's not clear to me that having more than a few hundred CGs
has any real benefits.  There was a massive gain in moving from
(effectively) a single CG in pre-FFS to a few dozen CGs in FFS as it
was first introduced.  Modern disks are roughly 5 orders of magnitude
larger and voice-coil actuators mean that seek times are almost
independent of distance.  CG sizes are currently limited by the
requirement that the cylinder group (including cylinder group maps)
must fit into a single FS block.  Removing this restriction would
allow CGs to be much larger.

Secondly, all the I/O during both snapshot creation and deletion is
in FS-block size chunks.  Increasing the I/O size would significantly
increase the I/O performance.  Whilst it doesn't make sense to read
more than you need, there still appears to be plenty of scope to
combine writes.

Between these two items, I would expect potential performance gains
of at least 20:1.

Note that I'm not suggesting that either of these items is trivial.

This is, unfortunately, quite true. Allowing non-atomic updates of the
cg block means a lot of complications in the softupdate code, IMHO.



I agree with all the above.  I think it has not been done because of 
exactly what Kostik says.  I really think that the CG max size is *way* 
too small now, and should be about 10-50 times larger, but performance 
tests would need to be run.


Eric

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: amrd disk performance drop after running under high load

2007-10-17 Thread Alexey Popov

Hi.

Kris Kennaway wrote:
After some time of running under high load disk performance become 
expremely poor. At that periods 'systat -vm 1' shows something like

this:

This web service is similiar to YouTube. This server is video store. I
have around 200G of *.flv (flash video) files on the server
I run lighttpd as a web server. Disk load is usually around 50%, 
network

output 100Mbit/s, 100 simultaneous connections. CPU is mostly idle.
This is very unlikely, because I have 5 another video storage servers 
of the same hardware and software configurations and they feel good.
Clearly something is different about them, though.  If you can 
characterize exactly what that is then it will help.
I can't see any difference but a date of installation. Really I compared 
all parameters and got nothing interesting.


At first glance one can say that problem is in Dell's x850 series or 
amr(4), but we run this hardware on many other projects and they work 
well. Also Linux on them works.


OK but there is no evidence in what you posted so far that amr is 
involved in any way.  There is convincing evidence that it is the mbuf 
issue.
Why are you sure this is the mbuf issue? For example, if there is a real 
problem with amr or VM causing disk slowdown, then when it occurs the 
network subsystem will have another load pattern. Instead of just quick 
sending large amounts of data, the system will have to accept large 
amount of sumultaneous connections waiting for data. Can this cause high 
mbuf contention?




And few hours ago I received feed back from Andrzej Tobola, he has the 
same problem on FreeBSD 7 with Promise ATA software mirror:
Well, he didnt provide any evidence yet that it is the same problem, so 
let's not become confused by feelings :)
I think he is telling about 100% disk busy while processing ~5 
transfers/sec.


So I can conclude that FreeBSD has a long standing bug in VM that 
could be triggered when serving large amount of static data (much 
bigger than memory size) on high rates. Possibly this only applies to 
large files like mp3 or video. 

It is possible, we have further work to do to conclude this though.
I forgot to mention I have pmc and kgmon profiling for good and bad 
times. But I have not enough knowledge to interpret it right and not 
sure if it can help.


Also now I run nginx instead of lighttpd on one of the problematic 
servers. It seems to work much better - sometimes there is a peaks in 
disk load, but disk does not become very slow and network output does 
not change. The difference of nginx is that it runs in multiple 
processes, while lighttpd by default has only one process. Now I 
configured lighttpd on other server to run in multiple workers. I'll see 
if it helps.


What else can i try?

With best regards,
Alexey Popov
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Ulf Lilleengen
On tir, okt 16, 2007 at 04:10:37 +0200, Karsten Behrmann wrote:
  Hi,
  is anybody working on the `Pluggable Disk Scheduler Project' from
  the ideas page?
 I've been kicking the idea around in my head, but I'm probably newer to
 everything involved than you are, so feel free to pick it up. If you want,
 we can toss some ideas and code to each other, though I don't really
 have anything on the latter.
 
 [...]
  After reading [1], [2] and its follow-ups the main problems that
  need to be addressed seem to be:
  
  o is working on disk scheduling worth at all?
 Probably, one of the main applications would be to make the background
 fsck a little more well-behaved.
I agree, as I said before, the ability to give I/O priorities is probably one
of the most important things.
 
  o Where is the right place (in GEOM) for a disk scheduler?
[...]
 
  o How can anticipation be introduced into the GEOM framework?
 I wouldn't focus on just anticipation, but also other types of
 schedulers (I/O scheduling influenced by nice value?)
 
  o What can be an interface for disk schedulers?
 good question, but geom seems a good start ;)
 
  o How to deal with devices that handle multiple request per time?
 Bad news first: this is most disks out there, in a way ;)
 SCSI has tagged queuing, ATA has native command queing or
 whatever the ata people came up over their morning coffee today.
 I'll mention a bit more about this further down.
 
  o How to deal with metadata requests and other VFS issues?
 Like any other disk request, though for priority-respecting
 schedulers this may get rather interesting.
 
 [...]
  The main idea is to allow the scheduler to enqueue the requests
  having only one (other small fixed numbers can be better on some
  hardware) outstanding request and to pass new requests to its
  provider only after the service of the previous one ended.
[...]
  - servers where anticipatory performs better than elevator
  - realtime environments that need a scheduler fitting their needs
  - the background fsck, if someone implements a priority scheduler

Apache is actally a good candidate according to the old antipacitory design
document ( not sure of it's relevance today, but...)

Over to a more general view of it's architecture:

When I looked at this project for the first time, I was under the impression
that this would be best done in a GEOM class.

However, I think the approach that was taken in the Hybrid project is better 
because of the following reasons:

- It makes it possible to use by _both_ GEOM classes and device drivers (Which 
  might use some other scheduler type?).
- Does not remove any configuratbility, since changing etc. can be done by user 
  with sysctl.
- Could make it possible for a GEOM class to decide for itself which scheduler 
it
  wants to use (most GEOM classes uses the standard bioq_disksort interface in
  disk_subr.c).
- The ability to stack a GEOM class with a scheduler could easily be emulated
  by creating a GEOM class to utilize the disksched framework.

All in all, I just think this approach gives more flexibility than putting it in
a GEOM class that have to be added manually by a user.

Just my thought on this.

Also, I got my test-box up again today, and will be trying your patch as soon
as I've upgraded it to CURRENT Fabio.
-- 
Ulf Lilleengen
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Fabio Checconi
 From: Karsten Behrmann [EMAIL PROTECTED]
 Date: Tue, Oct 16, 2007 04:10:37PM +0200

  Hi,
  is anybody working on the `Pluggable Disk Scheduler Project' from
  the ideas page?
 I've been kicking the idea around in my head, but I'm probably newer to
 everything involved than you are, so feel free to pick it up. If you want,
 we can toss some ideas and code to each other, though I don't really
 have anything on the latter.

Thank you for your answer, I'd really like to work/discuss with you
and anyone else interested in this project.


  o Where is the right place (in GEOM) for a disk scheduler?
 I have spent some time at eurobsdcon talking to Kirk and phk about
 this, and the result was that I now know strong proponents both for
 putting it into the disk drivers and for putting it into geom ;-)
 
 Personally, I would put it into geom. I'll go into more detail on
 this later, but basically, geom seems a better fit for high-level
 code than a device driver, and if done properly performance penalties
 should be negligible.
 

I'm pretty interested even in the arguments for the opposite solution;
I've started from GEOM because it seemed to be a) what was proposed/
requested on the ideas page, and b) cleaner at least for a prototype.
I wanted to start with some code also to evaluate the performance
penalties of that approach.

I am a little bit scared from the perspective of changing the
queueing mechanisms that drivers use, as this kind of modifications
can be difficult to write, test and maintain, but I'd really like
to know what people with experience in those kernel areas think
about the possibility of doing more complex io scheduling, with
some sort of unified interface, at this level.

As a side note, by now I've not posted any performance number because
at the moment I've only access to old ata drives that would not
give significative results.


  o How can anticipation be introduced into the GEOM framework?
 I wouldn't focus on just anticipation, but also other types of
 schedulers (I/O scheduling influenced by nice value?)
 

That would be interesting, especially for the background fsck case.
I think that some kind of fair sharing approach should be used; as
you say below a priority driven scheduler can have relations with
the VFS that are difficult to track. (This problem was pointed out
also in one of the follow-ups to [1].)


  o How to deal with metadata requests and other VFS issues?
 Like any other disk request, though for priority-respecting
 schedulers this may get rather interesting.
 
 [...]
  The main idea is to allow the scheduler to enqueue the requests
  having only one (other small fixed numbers can be better on some
  hardware) outstanding request and to pass new requests to its
  provider only after the service of the previous one ended.
 You'll want to queue at least two requests at once. The reason for
 this is performance:
 Currently, drivers queue their own I/O. This means that as soon
 as a request completes (on devices that don't have in-device
 queues), they can fairly quickly grab a new request from their
 internal queue and push it back to the device from the interrupt
 handler or some other fast method.

Wouldn't that require, to be sustainable (unless you want a fast
dispatch every two requests,) that the driver queue is always of
length two or more?  In this way you ask, to refill the driver
queue, the upper scheduler to dispatch a new request every time a
request is taken from the driver queue and sent to the disk, or
in any other moment before the request under service is completed.
In this way you cannot have an anticipation mechanism, because
the next request you'll want to dispatch from the upper scheduler
has not yet been issued (it will be only after the one being served
is completed and after the userspace application restarts.)


 Having the device idle while the response percolates up the geom
 stack and a new request down will likely be rather wasteful.

I completely agree on that.  I've only done in this way because it
was the less intrusive option I could find.  What can be other more
efficient alternatives?  (Obviously without subverting any of the
existing interfaces, and allowing the anticipation of requests.)


 For disks with queuing, I'd recommend trying to keep the queue
 reasonably full (unless the queuing strategy says otherwise),
 for disks without queuing I'd say we want to push at least one
 more request down. Personally, I think the sanest design would
 be to have device drivers return a temporary I/O error along
 the lines of EAGAIN, meaning their queue is full.
 

This can be easily done for asynchronous requests.  The troubles
arise when dealing with synchronous requests... i.e., if you dispatch
more than one synchronous request you are serving more than one
process, and you have long seek times between the requests you have
dispatched.  I think we should do (I'll do as soon as I have access
to some realistic test system) some 

Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Fabio Checconi
 From: Ulf Lilleengen [EMAIL PROTECTED]
 Date: Wed, Oct 17, 2007 01:07:15PM +0200

 On tir, okt 16, 2007 at 04:10:37 +0200, Karsten Behrmann wrote:
 Over to a more general view of it's architecture:
 
 When I looked at this project for the first time, I was under the impression
 that this would be best done in a GEOM class.
 
 However, I think the approach that was taken in the Hybrid project is better 

Ok.  I think that such a solution requires a lot more effort on the
design and coding sides, as it requires the modification of the
drivers and can bring us problems with locking and with the queueing
assumptions that may vary on a per-driver basis.

Maybe I've not enough experience/knowledge of the driver subsystem,
but I would not remove the queueing that is done now by the drivers
(think of ata freezepoints,) but instead I'd like to try to grab
the requests before they get to the driver (e.g., in/before their
d_strategy call) and have some sort of pull mechanism when requests
complete (still don't have any (serious) idea on that, I fear that
the right place to do that, for locking issues and so on, can be
driver dependent.)  Any ideas on that?  Which drivers can be good
starting points to try to write down some code?


 Also, I got my test-box up again today, and will be trying your patch as soon
 as I've upgraded it to CURRENT Fabio.

Thank you very much!  Please consider that my primary concern with
the patch was its interface, the algorithm is just an example (it
should give an idea of the performance loss due to the mechanism
overhead with async requests, and some improvement on greedy sync
loads.)

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Filesystem snapshots dog slow

2007-10-17 Thread Kostik Belousov
On Wed, Oct 17, 2007 at 08:00:03PM +1000, Peter Jeremy wrote:
 On 2007-Oct-16 06:54:11 -0500, Eric Anderson [EMAIL PROTECTED] wrote:
 will give you a good understanding of what the issue is. Essentially, your 
 disk is hammered making copies of all the cylinder groups, skipping those 
 that are 'busy', and coming back to them later. On a 200Gb disk, you could 
 have 1000 cylinder groups, each having to be locked, copied, unlocked, and 
 then checked again for any subsequent changes.  The stalls you see are when 
 there are lock contentions, or disk IO issues.  On a single disk (like your 
 setup above), your snapshots will take forever since there is very little 
 random IO performance available to you.
 
 That said, there is a fair amount of scope available for improving
 both the creation and deletion performance.
 
 Firstly, it's not clear to me that having more than a few hundred CGs
 has any real benefits.  There was a massive gain in moving from
 (effectively) a single CG in pre-FFS to a few dozen CGs in FFS as it
 was first introduced.  Modern disks are roughly 5 orders of magnitude
 larger and voice-coil actuators mean that seek times are almost
 independent of distance.  CG sizes are currently limited by the
 requirement that the cylinder group (including cylinder group maps)
 must fit into a single FS block.  Removing this restriction would
 allow CGs to be much larger.
 
 Secondly, all the I/O during both snapshot creation and deletion is
 in FS-block size chunks.  Increasing the I/O size would significantly
 increase the I/O performance.  Whilst it doesn't make sense to read
 more than you need, there still appears to be plenty of scope to
 combine writes.
 
 Between these two items, I would expect potential performance gains
 of at least 20:1.
 
 Note that I'm not suggesting that either of these items is trivial.
This is, unfortunately, quite true. Allowing non-atomic updates of the
cg block means a lot of complications in the softupdate code, IMHO.


pgpWGZBa4iDIc.pgp
Description: PGP signature


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Ulf Lilleengen
On ons, okt 17, 2007 at 02:19:07 +0200, Fabio Checconi wrote:
  From: Ulf Lilleengen [EMAIL PROTECTED]
  Date: Wed, Oct 17, 2007 01:07:15PM +0200
 
  On tir, okt 16, 2007 at 04:10:37 +0200, Karsten Behrmann wrote:
  Over to a more general view of it's architecture:
  
  When I looked at this project for the first time, I was under the impression
  that this would be best done in a GEOM class.
  
  However, I think the approach that was taken in the Hybrid project is 
  better 
 
 Ok.  I think that such a solution requires a lot more effort on the
 design and coding sides, as it requires the modification of the
 drivers and can bring us problems with locking and with the queueing
 assumptions that may vary on a per-driver basis.
 
I completely agree with the issue of converting device drivers, but at least
it will be an _optional_ possibility (Having different scheduler plugins
could make this possible). One does not necessary need to convert
the drivers. 
 Maybe I've not enough experience/knowledge of the driver subsystem,
 but I would not remove the queueing that is done now by the drivers
 (think of ata freezepoints,) but instead I'd like to try to grab
 the requests before they get to the driver (e.g., in/before their
 d_strategy call) and have some sort of pull mechanism when requests
 complete (still don't have any (serious) idea on that, I fear that
 the right place to do that, for locking issues and so on, can be
 driver dependent.)  Any ideas on that?  Which drivers can be good
 starting points to try to write down some code?
 
If you look at it, Hybrid is just a generalization of the existing
bioq_* API already defined. And this API is used by GEOM classes _before_
device drivers get the requests AFAIK. 

For a simple example on a driver, the md-driver might be a good place to
look. Note that I have little experience and knowledge of the driver
subsystem myself.

Also note (from the Hybrid page):
* we could not provide support for non work-conserving schedulers, due to a
  couple of reasons:
 1. the assumption, in some drivers, that bioq_disksort() will make
 requests immediately available (so a subsequent bioq_first() will
 not return NULL).
 2. the fact that there is no bioq_lock()/bioq_unlock(), so the
 scheduler does not have a safe way to generate requests for a given
 queue. 

This certainly argues for having this in the GEOM layer, but perhaps it's
possible to change the assumtions done in some drivers? The locking issue
should perhaps be better planned though, and an audit of the driver disksort
code is necessary.

Also:
* as said, the ATA driver in 6.x/7.x moves the disksort one layer below the
  one we are working at, so this particular work won't help on ATA-based 6.x
  machines.
  We should figure out how to address this, because the work done at that
  layer is mostly a replica of the bioq_*() API. 

So, I see this can get a bit messy thinking of that the ATA drivers does
disksorts on its own, but perhaps it would be possible to fix this by letting
changing the general ATA driver to use it's own pluggable scheduler.

Anyway, I shouldn't demand that you do this, especially since I don't have
any code or anything to show to, and because you decide what you want to do.
However, I'd hate to see the Hybrid effort go to waste :) I was hoping some
of the authors of the project would reply with their thoughts, so I CC'ed
them. 

-- 
Ulf Lilleengen
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Inner workings of turnstiles and sleepqueues

2007-10-17 Thread John Baldwin
On Tuesday 16 October 2007 05:41:18 am Ed Schouten wrote:
 Hello,
 
 I asked the following question on questions@, but as requested, I'll
 forward this question to this list, because of its technical nature.
 
 - Forwarded message from Ed Schouten [EMAIL PROTECTED] -
  Date: Mon, 15 Oct 2007 23:13:01 +0200
  From: Ed Schouten [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Subject: Inner workings of turnstiles and sleepqueues
  
  Hello,
  
  For some reason, I want to understand how the queueing of blocked
  threads in the kernel works when waiting for a lock, which is if I
  understand correctly done by the turnstiles and sleepqueues. I'm the
  proud owner of The Design and Implementation of the FreeBSD Operating
  System book, but for some reason, I can't find anything about it in the
  book.
  
  Is there a way to obtain information about how they work? I already read
  the source somewhat, but that shouldn't be an ideal solution, in my
  opinion.

The best option right now is to read the code.  There are some comments in
both the headers and implementation.

-- 
John Baldwin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Luigi Rizzo
On Wed, Oct 17, 2007 at 03:09:35PM +0200, Ulf Lilleengen wrote:

... discussion on Hybrid vs. GEOM as a suitable location for
... pluggable disk schedulers

 However, I'd hate to see the Hybrid effort go to waste :) I was hoping some
 of the authors of the project would reply with their thoughts, so I CC'ed
 them. 

we are in good contact with Fabio and i am monitoring the discussion,
don't worry.

cheers
luigi
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Fabio Checconi
 From: Ulf Lilleengen [EMAIL PROTECTED]
 Date: Wed, Oct 17, 2007 03:09:35PM +0200

 On ons, okt 17, 2007 at 02:19:07 +0200, Fabio Checconi wrote:
  Maybe I've not enough experience/knowledge of the driver subsystem,
[...]
 If you look at it, Hybrid is just a generalization of the existing
 bioq_* API already defined. And this API is used by GEOM classes _before_
 device drivers get the requests AFAIK. 
 

I looked at the Hybrid code, but I don't think that the bioq_*
family of calls can be the right place to start, for the problems
experienced during the Hybrid development with locking/anticipation
and because you can have the same request passing through multiple
bioqs during its path to the device (e.g., two stacked geoms using
two different bioqs and then a device driver using bioq_* to organize
its queue, or geoms using more than one bioq, like raid3; I think
the complexity can become unmanageable.)  One could even think to
configure each single bioq in the system, but things can get very
complex in this way.


 For a simple example on a driver, the md-driver might be a good place to
 look. Note that I have little experience and knowledge of the driver
 subsystem myself.
 

I'll take a look, thanks.


 Also note (from the Hybrid page):
 * we could not provide support for non work-conserving schedulers, due to a
[...]
 
 This certainly argues for having this in the GEOM layer, but perhaps it's
 possible to change the assumtions done in some drivers? The locking issue
 should perhaps be better planned though, and an audit of the driver disksort
 code is necessary.
 

I need some more time to think about that :)


 Also:
 * as said, the ATA driver in 6.x/7.x moves the disksort one layer below the
   one we are working at, so this particular work won't help on ATA-based 6.x
   machines.
   We should figure out how to address this, because the work done at that
   layer is mostly a replica of the bioq_*() API. 
 
 So, I see this can get a bit messy thinking of that the ATA drivers does
 disksorts on its own, but perhaps it would be possible to fix this by letting
 changing the general ATA driver to use it's own pluggable scheduler.
 
 Anyway, I shouldn't demand that you do this, especially since I don't have
 any code or anything to show to, and because you decide what you want to do.

I still cannot say if a GEOM scheduler is better than a scheduler
put at a lower level, or if the bioq_* interface is better than any
other alternative, so your suggestions are welcome.  Moreover I'd
really like to discuss/work together, or at least do things with
some agreement on them.  If I'll have the time to experiment with
more than one solution I'll be happy to do that.


 However, I'd hate to see the Hybrid effort go to waste :) I was hoping some
 of the authors of the project would reply with their thoughts, so I CC'ed
 them. 

Well, the work done on Hybrid had also interesting aspects from
the algorithm side... but that's another story...

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Interrupt/speed problems with 6.2 NFS server

2007-10-17 Thread Doug Clements

 Can you also send the output of ps -auxl?

 Also - do you notice this performance drop when running something like
 one of the network performance tools? I'd like to isolate the disk
 activity from the network activity for a clean test..


I tested this with iperf, and while I did see some nfs performance
degradation, it did not bog the machine or delay the terminal in the same
way. NFS requests were still processed in an acceptable fashion. It was
still responsive to commands and ran processes just fine. I would have to
say it performed as to be expected during iperf tests (on which I got about
85mbit/sec, which is also to be expected). Interrupts went down to about
2000/sec on em, but the machine did not hang.

Something I've noticed is that running 'systat -vm' seems to be part of the
problem. If I run the file copy by itself with rdist, it's fast and runs ok.
If I run it with systat -vm going, this is when the interrupts jump way up
and the machine starts to delay badly. Noticing this, I tried running
'sysctl -a' during the file copy, thinking there was some problem with
polling the kernel for certain statistics. Sure enough, sysctl -a delays at
2 spots. Once right after kern.random.sys.harvest.swi: 0 and once again
after debug.hashstat.nchash: 131072 61777 6 4713. While it is delayed here
(for a couple seconds for each one) the machine is totally hung. Maybe this
is a statistics polling issue? Maybe the machine is delayed just long enough
in systat -vm to make the nfs clients retry, causing a storm of interrupts?

Other systat modes do not seem to cause the same problem (pigs, icmp,
ifstat).

I do not think the ps or systat output is very accurate, since I can't get
them to run when the machine is hung up. I type in the command, but it does
not run until the machine springs back to life. I'm not sure how this will
affect measurements.

http://toric.loungenet.org/~doug/sysctl-a
http://toric.loungenet.org/~doug/psauxl
http://toric.loungenet.org/~doug/systat-vm

My real confusion lies in why there are still em interrupts at all, with
polling on.

Thanks!

--Doug
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: amrd disk performance drop after running under high load

2007-10-17 Thread Alexey Popov

Hi

Kris Kennaway wrote:
And few hours ago I received feed back from Andrzej Tobola, he has 
the same problem on FreeBSD 7 with Promise ATA software mirror:
Well, he didnt provide any evidence yet that it is the same problem, 
so let's not become confused by feelings :)
I think he is telling about 100% disk busy while processing ~5 
transfers/sec.


% busy as reported by gstat doesn't mean what you think it does.  What 
is the I/O response time?  That's the meaningful statistic for 
evaluating I/O load.  Also you didnt post about this.
At the problematic time the disk felt to be very slow, processes all 
were in reading disk state and vmstat proved it by the % numbers.


So I can conclude that FreeBSD has a long standing bug in VM that 
could be triggered when serving large amount of static data (much 
bigger than memory size) on high rates. Possibly this only applies 
to large files like mp3 or video. 

It is possible, we have further work to do to conclude this though.
I forgot to mention I have pmc and kgmon profiling for good and bad 
times. But I have not enough knowledge to interpret it right and not 
sure if it can help.

pmc would be useful.
Unfortunately i've lost pmc profiling results. I'll try to collect it 
again later. See vmstats in attach (vmstat -z; netstat -m; vmstat -i; 
vmstat -w 1 | head -11;).


Also you can see kgmon profiling results at: http://83.167.98.162/gprof/

With best regards,
Alexey Popov

ITEM SIZE LIMIT  USED  FREE  REQUESTS  FAILURES

UMA Kegs: 240,0,   71,4,   71,0
UMA Zones:376,0,   71,9,   71,0
UMA Slabs:128,0, 1011,   62,   243081,0
UMA RCntSlabs:128,0,  361, 1205,   363320,0
UMA Hash: 256,0,4,   11,7,0
16 Bucket:152,0,   45,   30,   72,0
32 Bucket:280,0,   25,   45,   69,0
64 Bucket:536,0,   17,   25,   55,   53
128 Bucket:  1048,0,  287,   88, 1200,95423
VM OBJECT:224,0, 5536,23228,  7675004,0
MAP:  352,0,7,   15,7,0
KMAP ENTRY:   112,90222,  283, 1037,  1207524,0
MAP ENTRY:112,0, 1396,  419, 72221561,0
PV ENTRY:  48,  2244600,17835,30261, 768591673,0
DP fakepg:120,0,0,   31,   10,0
mt_zone: 1024,0,  170,6,  170,0
16:16,0, 3578, 2470, 745206870,0
32:32,0, 1273,  343,  1750850,0
64:64,0, 6147, 1693, 487691440,0
128:  128,0, 4659,  387,  1464251,0
256:  256,0,  596, 2539,  7208469,0
512:  512,0,  608,  253,   791295,0
1024:1024,0,   49,  239,82867,0
2048:2048,0,   27,  295,   115362,0
4096:4096,0,  240,  278,   564659,0
Files:120,0,  544,  324, 263880246,0
TURNSTILE:104,0,  181,   83,  307,0
PROC: 856,0,   82,   82,   308409,0
THREAD:   608,0,  169,   11,24468,0
KSEGRP:   136,0,  165,   69,  165,0
UPCALL:88,0,3,   73,3,0
SLEEPQUEUE:64,0,  181,   99,  307,0
VMSPACE:  544,0,   35,   77,   310929,0
mbuf_packet:  256,0,  368,  115, 1331807039,0
mbuf: 256,0, 2016, 2331, 5433003167,0
mbuf_cluster:2048,32768,  483,  239, 1236143964,0
mbuf_jumbo_pagesize: 4096,0,0,0,0,0
mbuf_jumbo_9k:   9216,0,0,0,0,0
mbuf_jumbo_16k: 16384,0,0,0,0,0
ACL UMA zone: 388,0,0,0,0,0
g_bio:216,0,4,  410, 48175991,0
ata_request:  336,0,0,   22,   24,0
ata_composite:376,0,0,0,0,0
VNODE:

Re: Useful tools missing from /rescue

2007-10-17 Thread Yar Tikhiy
On Mon, Oct 15, 2007 at 10:38:26AM -0700, David O'Brien wrote:
 On Sat, Oct 13, 2007 at 10:01:39AM +0400, Yar Tikhiy wrote:
  On Wed, Oct 03, 2007 at 07:23:44PM -0700, David O'Brien wrote:
   I also don't see the need for pgrep - I think needing that says your
   system is running multiuser pretty well.
  
  First of all, I'd like to point out that /rescue doesn't need to
  be as minimal as /stand used to.  Now, /rescue is a compact yet
  versatile set of essential tools that can help in any difficult
  situation when /*bin:/usr/*bin are unusable for some reason, not
  only in restoring a broken system while in single-user mode.  A
 ..
  As for pgrep+pkill, it can come handy if one has screwed up his
  live system and wants to recover it without dropping the system to
  single-user.
 
 But if we take this just a little bit farther then why don't we go back
 to a static /[s]bin except for the few things one might need LDAP, etc..
 for?  That is, what's the purpose in continuing to duplicate /[s]bin
 into /rescue?  /rescue should be just enough to reasonably get a system
 who's shared libs are messed up working again.

Note that /rescue includes the most essential tools from /usr/[s]bin,
too.  Irrespective of its initial purpose, I regard /rescue as an
emergency toolset left aside.  In particular, it's good to know
it's there when you experiment with a live remote system.

  A valid objection to this point is that pgrep's job
  can be done with a combination of ps(1) and sed(1), so it's just a
  matter of convenience.
 
 I guess I'm still having trouble understanding why one would need 'ps'
 to fix a shared libs issue.  Now is a reason to keep adding stuff to

IMHO it isn't only shared libs issues that /rescue can help with.

 /rescue.  Also why one would be running 'ps -aux', which is the only way
 I can think of to get more than one screen of output if a system is in
 trouble.

Imagine that you've rm'ed /usr by accident in a remote shell session.
With enough tools in /rescue (which doesn't take lots of tools,)
you can stop sensitive daemons, find the backup, restore from it,
and get a functional system again without a reboot.  No doubt, some
tools just make the task easier by providing typical command-line
idioms.

I don't mean I'm so reckless that I need to restore my /usr often,
but the 3-4 megabytes occupied by /rescue are a terribly low price
today for being able to shoot different parts of one's foot without
necessarily hitting the bone.

  The price for it in terms of disk space is next to nothing, and there
  are quite useless space hogs in /rescue already (see below on
  /rescue/vi.)
 
 Considering how few people are skilled in ed(1) these days, we have
 little choice but include vi.

Of course, there should be /rescue/vi, and I have an idea on how
to remove its dependence on /usr in a more or less elegant way.  I
mentioned the not-so-functional /rescue/vi here just to show that
we can tolerate certain space waste in /rescue.

  I won't speak for everyone, but I really like to use fancy shell
  commands, particularly during hard times: loops, pipelines, etc.
  So I don't have to enter many commands for a single task or browse
 
 I guess I'm not creative enough in the ways I've screwed up my systems
 and needed tools from /rescue. 8-)

Just try to installworld FreeBSD/amd64 over a running FreeBSD/i386. ;-)

   I don't see the purpose of chown - if you have to fall back to /rescue
   you're user 'root' - and you're trying to fix enough so you can use
   standard /*lib  /*bin
 ..
  Having /rescue/chown is just a matter of completeness of the ch*
  subset of /rescue tools because chown's job can't be done by any
  other stock tools.  If /rescue is complete enough, one can find
  more applications for it.  E.g., the loader, a kernel, and /rescue
 
 /rescue wasn't intended to be well orthogonal.  /rescue was part of he
 corner stone of the deal to switch to shared /[s]bin.

But it doesn't confine us to the corner forever.  Having an emergency
toolset independent of the rest of the system is good in any case.
I bet people will experiment and have fun with their systems more
eagerly if they know they still can recover quickly with ready tools
in case of a serious error.

-- 
Yar
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Bruno Tavares
On 10/17/07, Fabio Checconi [EMAIL PROTECTED] wrote:
  From: Ulf Lilleengen [EMAIL PROTECTED]
  Date: Wed, Oct 17, 2007 03:09:35PM +0200
 
  On ons, okt 17, 2007 at 02:19:07 +0200, Fabio Checconi wrote:
   Maybe I've not enough experience/knowledge of the driver subsystem,
 [...]
  If you look at it, Hybrid is just a generalization of the existing
  bioq_* API already defined. And this API is used by GEOM classes _before_
  device drivers get the requests AFAIK.
 

 I looked at the Hybrid code, but I don't think that the bioq_*
 family of calls can be the right place to start, for the problems
 experienced during the Hybrid development with locking/anticipation
 and because you can have the same request passing through multiple
 bioqs during its path to the device (e.g., two stacked geoms using
 two different bioqs and then a device driver using bioq_* to organize
 its queue, or geoms using more than one bioq, like raid3; I think
 the complexity can become unmanageable.)  One could even think to
 configure each single bioq in the system, but things can get very
 complex in this way.


  For a simple example on a driver, the md-driver might be a good place to
  look. Note that I have little experience and knowledge of the driver
  subsystem myself.
 

 I'll take a look, thanks.


  Also note (from the Hybrid page):
  * we could not provide support for non work-conserving schedulers, due to a
 [...]
 
  This certainly argues for having this in the GEOM layer, but perhaps it's
  possible to change the assumtions done in some drivers? The locking issue
  should perhaps be better planned though, and an audit of the driver disksort
  code is necessary.
 

 I need some more time to think about that :)


  Also:
  * as said, the ATA driver in 6.x/7.x moves the disksort one layer below the
one we are working at, so this particular work won't help on ATA-based 6.x
machines.
We should figure out how to address this, because the work done at that
layer is mostly a replica of the bioq_*() API.
 
  So, I see this can get a bit messy thinking of that the ATA drivers does
  disksorts on its own, but perhaps it would be possible to fix this by 
  letting
  changing the general ATA driver to use it's own pluggable scheduler.
 
  Anyway, I shouldn't demand that you do this, especially since I don't have
  any code or anything to show to, and because you decide what you want to do.

 I still cannot say if a GEOM scheduler is better than a scheduler
 put at a lower level, or if the bioq_* interface is better than any
 other alternative, so your suggestions are welcome.  Moreover I'd
 really like to discuss/work together, or at least do things with
 some agreement on them.  If I'll have the time to experiment with
 more than one solution I'll be happy to do that.


  However, I'd hate to see the Hybrid effort go to waste :) I was hoping some
  of the authors of the project would reply with their thoughts, so I CC'ed
  them.

 Well, the work done on Hybrid had also interesting aspects from
 the algorithm side... but that's another story...

 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to [EMAIL PROTECTED]



-- 

 This .signature sanitized for your protection
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]