Re: amrd disk performance drop after running under high load
Alexey Popov wrote: This is very unlikely, because I have 5 another video storage servers of the same hardware and software configurations and they feel good. Clearly something is different about them, though. If you can characterize exactly what that is then it will help. I can't see any difference but a date of installation. Really I compared all parameters and got nothing interesting. At first glance one can say that problem is in Dell's x850 series or amr(4), but we run this hardware on many other projects and they work well. Also Linux on them works. OK but there is no evidence in what you posted so far that amr is involved in any way. There is convincing evidence that it is the mbuf issue. Why are you sure this is the mbuf issue? Because that is the only problem shown in the data you posted. For example, if there is a real problem with amr or VM causing disk slowdown, then when it occurs the network subsystem will have another load pattern. Instead of just quick sending large amounts of data, the system will have to accept large amount of sumultaneous connections waiting for data. Can this cause high mbuf contention? I'd expect to see evidence of the main problem. And few hours ago I received feed back from Andrzej Tobola, he has the same problem on FreeBSD 7 with Promise ATA software mirror: Well, he didnt provide any evidence yet that it is the same problem, so let's not become confused by feelings :) I think he is telling about 100% disk busy while processing ~5 transfers/sec. % busy as reported by gstat doesn't mean what you think it does. What is the I/O response time? That's the meaningful statistic for evaluating I/O load. Also you didnt post about this. So I can conclude that FreeBSD has a long standing bug in VM that could be triggered when serving large amount of static data (much bigger than memory size) on high rates. Possibly this only applies to large files like mp3 or video. It is possible, we have further work to do to conclude this though. I forgot to mention I have pmc and kgmon profiling for good and bad times. But I have not enough knowledge to interpret it right and not sure if it can help. pmc would be useful. Also now I run nginx instead of lighttpd on one of the problematic servers. It seems to work much better - sometimes there is a peaks in disk load, but disk does not become very slow and network output does not change. The difference of nginx is that it runs in multiple processes, while lighttpd by default has only one process. Now I configured lighttpd on other server to run in multiple workers. I'll see if it helps. What else can i try? Still waiting on the vmstat -z output. Kris ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: amrd disk performance drop after running under high load
Kris Kennaway wrote: What else can i try? Still waiting on the vmstat -z output. Also can you please obtain vmstat -i, netstat -m and 10 seconds of representative vmstat -w output when the problem is and is not occurring? Kris ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Filesystem snapshots dog slow
On 2007-Oct-16 06:54:11 -0500, Eric Anderson [EMAIL PROTECTED] wrote: will give you a good understanding of what the issue is. Essentially, your disk is hammered making copies of all the cylinder groups, skipping those that are 'busy', and coming back to them later. On a 200Gb disk, you could have 1000 cylinder groups, each having to be locked, copied, unlocked, and then checked again for any subsequent changes. The stalls you see are when there are lock contentions, or disk IO issues. On a single disk (like your setup above), your snapshots will take forever since there is very little random IO performance available to you. That said, there is a fair amount of scope available for improving both the creation and deletion performance. Firstly, it's not clear to me that having more than a few hundred CGs has any real benefits. There was a massive gain in moving from (effectively) a single CG in pre-FFS to a few dozen CGs in FFS as it was first introduced. Modern disks are roughly 5 orders of magnitude larger and voice-coil actuators mean that seek times are almost independent of distance. CG sizes are currently limited by the requirement that the cylinder group (including cylinder group maps) must fit into a single FS block. Removing this restriction would allow CGs to be much larger. Secondly, all the I/O during both snapshot creation and deletion is in FS-block size chunks. Increasing the I/O size would significantly increase the I/O performance. Whilst it doesn't make sense to read more than you need, there still appears to be plenty of scope to combine writes. Between these two items, I would expect potential performance gains of at least 20:1. Note that I'm not suggesting that either of these items is trivial. -- Peter Jeremy pgpE9lGevDiGu.pgp Description: PGP signature
Re: Interrupt/speed problems with 6.2 NFS server
Doug Clements wrote: Hi, I have an new NFS server that is processing roughly 15mbit of NFS traffic that we recently upgraded from an older 4.10 box. It has a 3-ware raid card, and is serving NFS out a single em nic to LAN clients. The machine works great just serving NFS, but when I try to copy data from one raid volume to another for backups, the machine's NFS performance goes way down, and NFSops start taking multiple seconds to perform. The file copy goes quite quickly, as would be expected. The console of the machine also starts to lag pretty badly, and I get the 'typing through mud' effect. I use rdist6 to do the backup. My first impression was that I was having interrupt issues, since during the backup, the em interfaces were pushing over 200k interrupts/sec (roughly 60% CPU processing interrupts). So I recompiled the kernel with polling enabled and enabled it on the NICs. The strange thing is that polling shows enabled in ifconfig, but systat -vm still shows the same amount of interrupts. I get the same performance with polling enabled. I'm looking for some guidance on why the machine bogs so much during what seems to me to be something that should barely impact machine performance at all, and also why polling didn't seem to lower the number of interrupts processed. The old machine was 6 years old running an old intel raid5, and it handled NFS and the concurrent file copies without a sweat. My 3ware is setup as follows: a 2 disk mirror, for the system a 4 disk raid10, for /mnt/data1 a 4 disk raid10, for /mnt/data2 Copyright (c) 1992-2007 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 6.2-RELEASE-p8 #0: Thu Oct 11 10:43:22 PDT 2007 [EMAIL PROTECTED] :/usr/obj/usr/src/sys/MADONNA Timecounter i8254 frequency 1193182 Hz quality 0 CPU: Genuine Intel(R) CPU @ 2.66GHz (2670.65-MHz K8-class CPU) Origin = GenuineIntel Id = 0x6f4 Stepping = 4 Features=0xbfebfbffFPU,VME,DE ,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE Features2=0x4e3bdSSE3,RSVD2,MON,DS_CPL,VMX,EST,TM2,b9,CX16,b14,b15,b18 AMD Features=0x20100800SYSCALL,NX,LM AMD Features2=0x1LAHF Cores per package: 2 real memory = 4831838208 (4608 MB) avail memory = 4125257728 (3934 MB) ACPI APIC Table: INTEL S5000PSL ioapic0 Version 2.0 irqs 0-23 on motherboard ioapic1 Version 2.0 irqs 24-47 on motherboard lapic0: Forcing LINT1 to edge trigger kbd1 at kbdmux0 ath_hal: 0.9.17.2 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413) acpi0: INTEL S5000PSL on motherboard acpi0: Power Button (fixed) Timecounter ACPI-fast frequency 3579545 Hz quality 1000 acpi_timer0: 24-bit timer at 3.579545MHz port 0x408-0x40b on acpi0 cpu0: ACPI CPU on acpi0 acpi_throttle0: ACPI CPU Throttling on cpu0 acpi_button0: Sleep Button on acpi0 acpi_button1: Power Button on acpi0 pcib0: ACPI Host-PCI bridge port 0xca2,0xca3,0xcf8-0xcff on acpi0 pci0: ACPI PCI bus on pcib0 pcib1: ACPI PCI-PCI bridge at device 2.0 on pci0 pci1: ACPI PCI bus on pcib1 pcib2: ACPI PCI-PCI bridge irq 16 at device 0.0 on pci1 pci2: ACPI PCI bus on pcib2 pcib3: ACPI PCI-PCI bridge irq 16 at device 0.0 on pci2 pci3: ACPI PCI bus on pcib3 pcib4: ACPI PCI-PCI bridge irq 17 at device 1.0 on pci2 pci4: ACPI PCI bus on pcib4 pcib5: ACPI PCI-PCI bridge irq 18 at device 2.0 on pci2 pci5: ACPI PCI bus on pcib5 em0: Intel(R) PRO/1000 Network Connection Version - 6.2.9 port 0x3020-0x303f mem 0xf882-0xf883,0xf840-0xf87f irq 18 at device 0.0 on pci5 em0: Ethernet address: 00:15:17:21:bf:30 em1: Intel(R) PRO/1000 Network Connection Version - 6.2.9 port 0x3000-0x301f mem 0xf880-0xf881,0xf800-0xf83f irq 19 at device 0.1 on pci5 em1: Ethernet address: 00:15:17:21:bf:31 pcib6: ACPI PCI-PCI bridge at device 0.3 on pci1 pci6: ACPI PCI bus on pcib6 3ware device driver for 9000 series storage controllers, version: 3.60.02.012 twa0: 3ware 9000 series Storage Controller port 0x2000-0x203f mem 0xfa00-0xfbff,0xf890-0xf8900fff irq 26 at device 2.0 on pci6 twa0: [GIANT-LOCKED] twa0: INFO: (0x15: 0x1300): Controller details:: Model 9550SX-12, 12 ports, Firmware FE9X 3.08.00.004, BIOS BE9X 3.08.00.002 pcib7: PCI-PCI bridge at device 3.0 on pci0 pci7: PCI bus on pcib7 pcib8: ACPI PCI-PCI bridge at device 4.0 on pci0 pci8: ACPI PCI bus on pcib8 pcib9: ACPI PCI-PCI bridge at device 5.0 on pci0 pci9: ACPI PCI bus on pcib9 pcib10: ACPI PCI-PCI bridge at device 6.0 on pci0 pci10: ACPI PCI bus on pcib10 pcib11: PCI-PCI bridge at device 7.0 on pci0 pci11: PCI bus on pcib11 pci0: base peripheral at device 8.0 (no driver attached) pcib12: ACPI PCI-PCI bridge irq 16 at device 28.0 on pci0 pci12: ACPI PCI bus on pcib12 uhci0: UHCI (generic) USB controller port 0x4080-0x409f irq 23 at device 29.0 on pci0 uhci0:
Re: Filesystem snapshots dog slow
Kostik Belousov wrote: On Wed, Oct 17, 2007 at 08:00:03PM +1000, Peter Jeremy wrote: On 2007-Oct-16 06:54:11 -0500, Eric Anderson [EMAIL PROTECTED] wrote: will give you a good understanding of what the issue is. Essentially, your disk is hammered making copies of all the cylinder groups, skipping those that are 'busy', and coming back to them later. On a 200Gb disk, you could have 1000 cylinder groups, each having to be locked, copied, unlocked, and then checked again for any subsequent changes. The stalls you see are when there are lock contentions, or disk IO issues. On a single disk (like your setup above), your snapshots will take forever since there is very little random IO performance available to you. That said, there is a fair amount of scope available for improving both the creation and deletion performance. Firstly, it's not clear to me that having more than a few hundred CGs has any real benefits. There was a massive gain in moving from (effectively) a single CG in pre-FFS to a few dozen CGs in FFS as it was first introduced. Modern disks are roughly 5 orders of magnitude larger and voice-coil actuators mean that seek times are almost independent of distance. CG sizes are currently limited by the requirement that the cylinder group (including cylinder group maps) must fit into a single FS block. Removing this restriction would allow CGs to be much larger. Secondly, all the I/O during both snapshot creation and deletion is in FS-block size chunks. Increasing the I/O size would significantly increase the I/O performance. Whilst it doesn't make sense to read more than you need, there still appears to be plenty of scope to combine writes. Between these two items, I would expect potential performance gains of at least 20:1. Note that I'm not suggesting that either of these items is trivial. This is, unfortunately, quite true. Allowing non-atomic updates of the cg block means a lot of complications in the softupdate code, IMHO. I agree with all the above. I think it has not been done because of exactly what Kostik says. I really think that the CG max size is *way* too small now, and should be about 10-50 times larger, but performance tests would need to be run. Eric ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: amrd disk performance drop after running under high load
Hi. Kris Kennaway wrote: After some time of running under high load disk performance become expremely poor. At that periods 'systat -vm 1' shows something like this: This web service is similiar to YouTube. This server is video store. I have around 200G of *.flv (flash video) files on the server I run lighttpd as a web server. Disk load is usually around 50%, network output 100Mbit/s, 100 simultaneous connections. CPU is mostly idle. This is very unlikely, because I have 5 another video storage servers of the same hardware and software configurations and they feel good. Clearly something is different about them, though. If you can characterize exactly what that is then it will help. I can't see any difference but a date of installation. Really I compared all parameters and got nothing interesting. At first glance one can say that problem is in Dell's x850 series or amr(4), but we run this hardware on many other projects and they work well. Also Linux on them works. OK but there is no evidence in what you posted so far that amr is involved in any way. There is convincing evidence that it is the mbuf issue. Why are you sure this is the mbuf issue? For example, if there is a real problem with amr or VM causing disk slowdown, then when it occurs the network subsystem will have another load pattern. Instead of just quick sending large amounts of data, the system will have to accept large amount of sumultaneous connections waiting for data. Can this cause high mbuf contention? And few hours ago I received feed back from Andrzej Tobola, he has the same problem on FreeBSD 7 with Promise ATA software mirror: Well, he didnt provide any evidence yet that it is the same problem, so let's not become confused by feelings :) I think he is telling about 100% disk busy while processing ~5 transfers/sec. So I can conclude that FreeBSD has a long standing bug in VM that could be triggered when serving large amount of static data (much bigger than memory size) on high rates. Possibly this only applies to large files like mp3 or video. It is possible, we have further work to do to conclude this though. I forgot to mention I have pmc and kgmon profiling for good and bad times. But I have not enough knowledge to interpret it right and not sure if it can help. Also now I run nginx instead of lighttpd on one of the problematic servers. It seems to work much better - sometimes there is a peaks in disk load, but disk does not become very slow and network output does not change. The difference of nginx is that it runs in multiple processes, while lighttpd by default has only one process. Now I configured lighttpd on other server to run in multiple workers. I'll see if it helps. What else can i try? With best regards, Alexey Popov ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Pluggable Disk Scheduler Project
On tir, okt 16, 2007 at 04:10:37 +0200, Karsten Behrmann wrote: Hi, is anybody working on the `Pluggable Disk Scheduler Project' from the ideas page? I've been kicking the idea around in my head, but I'm probably newer to everything involved than you are, so feel free to pick it up. If you want, we can toss some ideas and code to each other, though I don't really have anything on the latter. [...] After reading [1], [2] and its follow-ups the main problems that need to be addressed seem to be: o is working on disk scheduling worth at all? Probably, one of the main applications would be to make the background fsck a little more well-behaved. I agree, as I said before, the ability to give I/O priorities is probably one of the most important things. o Where is the right place (in GEOM) for a disk scheduler? [...] o How can anticipation be introduced into the GEOM framework? I wouldn't focus on just anticipation, but also other types of schedulers (I/O scheduling influenced by nice value?) o What can be an interface for disk schedulers? good question, but geom seems a good start ;) o How to deal with devices that handle multiple request per time? Bad news first: this is most disks out there, in a way ;) SCSI has tagged queuing, ATA has native command queing or whatever the ata people came up over their morning coffee today. I'll mention a bit more about this further down. o How to deal with metadata requests and other VFS issues? Like any other disk request, though for priority-respecting schedulers this may get rather interesting. [...] The main idea is to allow the scheduler to enqueue the requests having only one (other small fixed numbers can be better on some hardware) outstanding request and to pass new requests to its provider only after the service of the previous one ended. [...] - servers where anticipatory performs better than elevator - realtime environments that need a scheduler fitting their needs - the background fsck, if someone implements a priority scheduler Apache is actally a good candidate according to the old antipacitory design document ( not sure of it's relevance today, but...) Over to a more general view of it's architecture: When I looked at this project for the first time, I was under the impression that this would be best done in a GEOM class. However, I think the approach that was taken in the Hybrid project is better because of the following reasons: - It makes it possible to use by _both_ GEOM classes and device drivers (Which might use some other scheduler type?). - Does not remove any configuratbility, since changing etc. can be done by user with sysctl. - Could make it possible for a GEOM class to decide for itself which scheduler it wants to use (most GEOM classes uses the standard bioq_disksort interface in disk_subr.c). - The ability to stack a GEOM class with a scheduler could easily be emulated by creating a GEOM class to utilize the disksched framework. All in all, I just think this approach gives more flexibility than putting it in a GEOM class that have to be added manually by a user. Just my thought on this. Also, I got my test-box up again today, and will be trying your patch as soon as I've upgraded it to CURRENT Fabio. -- Ulf Lilleengen ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Pluggable Disk Scheduler Project
From: Karsten Behrmann [EMAIL PROTECTED] Date: Tue, Oct 16, 2007 04:10:37PM +0200 Hi, is anybody working on the `Pluggable Disk Scheduler Project' from the ideas page? I've been kicking the idea around in my head, but I'm probably newer to everything involved than you are, so feel free to pick it up. If you want, we can toss some ideas and code to each other, though I don't really have anything on the latter. Thank you for your answer, I'd really like to work/discuss with you and anyone else interested in this project. o Where is the right place (in GEOM) for a disk scheduler? I have spent some time at eurobsdcon talking to Kirk and phk about this, and the result was that I now know strong proponents both for putting it into the disk drivers and for putting it into geom ;-) Personally, I would put it into geom. I'll go into more detail on this later, but basically, geom seems a better fit for high-level code than a device driver, and if done properly performance penalties should be negligible. I'm pretty interested even in the arguments for the opposite solution; I've started from GEOM because it seemed to be a) what was proposed/ requested on the ideas page, and b) cleaner at least for a prototype. I wanted to start with some code also to evaluate the performance penalties of that approach. I am a little bit scared from the perspective of changing the queueing mechanisms that drivers use, as this kind of modifications can be difficult to write, test and maintain, but I'd really like to know what people with experience in those kernel areas think about the possibility of doing more complex io scheduling, with some sort of unified interface, at this level. As a side note, by now I've not posted any performance number because at the moment I've only access to old ata drives that would not give significative results. o How can anticipation be introduced into the GEOM framework? I wouldn't focus on just anticipation, but also other types of schedulers (I/O scheduling influenced by nice value?) That would be interesting, especially for the background fsck case. I think that some kind of fair sharing approach should be used; as you say below a priority driven scheduler can have relations with the VFS that are difficult to track. (This problem was pointed out also in one of the follow-ups to [1].) o How to deal with metadata requests and other VFS issues? Like any other disk request, though for priority-respecting schedulers this may get rather interesting. [...] The main idea is to allow the scheduler to enqueue the requests having only one (other small fixed numbers can be better on some hardware) outstanding request and to pass new requests to its provider only after the service of the previous one ended. You'll want to queue at least two requests at once. The reason for this is performance: Currently, drivers queue their own I/O. This means that as soon as a request completes (on devices that don't have in-device queues), they can fairly quickly grab a new request from their internal queue and push it back to the device from the interrupt handler or some other fast method. Wouldn't that require, to be sustainable (unless you want a fast dispatch every two requests,) that the driver queue is always of length two or more? In this way you ask, to refill the driver queue, the upper scheduler to dispatch a new request every time a request is taken from the driver queue and sent to the disk, or in any other moment before the request under service is completed. In this way you cannot have an anticipation mechanism, because the next request you'll want to dispatch from the upper scheduler has not yet been issued (it will be only after the one being served is completed and after the userspace application restarts.) Having the device idle while the response percolates up the geom stack and a new request down will likely be rather wasteful. I completely agree on that. I've only done in this way because it was the less intrusive option I could find. What can be other more efficient alternatives? (Obviously without subverting any of the existing interfaces, and allowing the anticipation of requests.) For disks with queuing, I'd recommend trying to keep the queue reasonably full (unless the queuing strategy says otherwise), for disks without queuing I'd say we want to push at least one more request down. Personally, I think the sanest design would be to have device drivers return a temporary I/O error along the lines of EAGAIN, meaning their queue is full. This can be easily done for asynchronous requests. The troubles arise when dealing with synchronous requests... i.e., if you dispatch more than one synchronous request you are serving more than one process, and you have long seek times between the requests you have dispatched. I think we should do (I'll do as soon as I have access to some realistic test system) some
Re: Pluggable Disk Scheduler Project
From: Ulf Lilleengen [EMAIL PROTECTED] Date: Wed, Oct 17, 2007 01:07:15PM +0200 On tir, okt 16, 2007 at 04:10:37 +0200, Karsten Behrmann wrote: Over to a more general view of it's architecture: When I looked at this project for the first time, I was under the impression that this would be best done in a GEOM class. However, I think the approach that was taken in the Hybrid project is better Ok. I think that such a solution requires a lot more effort on the design and coding sides, as it requires the modification of the drivers and can bring us problems with locking and with the queueing assumptions that may vary on a per-driver basis. Maybe I've not enough experience/knowledge of the driver subsystem, but I would not remove the queueing that is done now by the drivers (think of ata freezepoints,) but instead I'd like to try to grab the requests before they get to the driver (e.g., in/before their d_strategy call) and have some sort of pull mechanism when requests complete (still don't have any (serious) idea on that, I fear that the right place to do that, for locking issues and so on, can be driver dependent.) Any ideas on that? Which drivers can be good starting points to try to write down some code? Also, I got my test-box up again today, and will be trying your patch as soon as I've upgraded it to CURRENT Fabio. Thank you very much! Please consider that my primary concern with the patch was its interface, the algorithm is just an example (it should give an idea of the performance loss due to the mechanism overhead with async requests, and some improvement on greedy sync loads.) ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Filesystem snapshots dog slow
On Wed, Oct 17, 2007 at 08:00:03PM +1000, Peter Jeremy wrote: On 2007-Oct-16 06:54:11 -0500, Eric Anderson [EMAIL PROTECTED] wrote: will give you a good understanding of what the issue is. Essentially, your disk is hammered making copies of all the cylinder groups, skipping those that are 'busy', and coming back to them later. On a 200Gb disk, you could have 1000 cylinder groups, each having to be locked, copied, unlocked, and then checked again for any subsequent changes. The stalls you see are when there are lock contentions, or disk IO issues. On a single disk (like your setup above), your snapshots will take forever since there is very little random IO performance available to you. That said, there is a fair amount of scope available for improving both the creation and deletion performance. Firstly, it's not clear to me that having more than a few hundred CGs has any real benefits. There was a massive gain in moving from (effectively) a single CG in pre-FFS to a few dozen CGs in FFS as it was first introduced. Modern disks are roughly 5 orders of magnitude larger and voice-coil actuators mean that seek times are almost independent of distance. CG sizes are currently limited by the requirement that the cylinder group (including cylinder group maps) must fit into a single FS block. Removing this restriction would allow CGs to be much larger. Secondly, all the I/O during both snapshot creation and deletion is in FS-block size chunks. Increasing the I/O size would significantly increase the I/O performance. Whilst it doesn't make sense to read more than you need, there still appears to be plenty of scope to combine writes. Between these two items, I would expect potential performance gains of at least 20:1. Note that I'm not suggesting that either of these items is trivial. This is, unfortunately, quite true. Allowing non-atomic updates of the cg block means a lot of complications in the softupdate code, IMHO. pgpWGZBa4iDIc.pgp Description: PGP signature
Re: Pluggable Disk Scheduler Project
On ons, okt 17, 2007 at 02:19:07 +0200, Fabio Checconi wrote: From: Ulf Lilleengen [EMAIL PROTECTED] Date: Wed, Oct 17, 2007 01:07:15PM +0200 On tir, okt 16, 2007 at 04:10:37 +0200, Karsten Behrmann wrote: Over to a more general view of it's architecture: When I looked at this project for the first time, I was under the impression that this would be best done in a GEOM class. However, I think the approach that was taken in the Hybrid project is better Ok. I think that such a solution requires a lot more effort on the design and coding sides, as it requires the modification of the drivers and can bring us problems with locking and with the queueing assumptions that may vary on a per-driver basis. I completely agree with the issue of converting device drivers, but at least it will be an _optional_ possibility (Having different scheduler plugins could make this possible). One does not necessary need to convert the drivers. Maybe I've not enough experience/knowledge of the driver subsystem, but I would not remove the queueing that is done now by the drivers (think of ata freezepoints,) but instead I'd like to try to grab the requests before they get to the driver (e.g., in/before their d_strategy call) and have some sort of pull mechanism when requests complete (still don't have any (serious) idea on that, I fear that the right place to do that, for locking issues and so on, can be driver dependent.) Any ideas on that? Which drivers can be good starting points to try to write down some code? If you look at it, Hybrid is just a generalization of the existing bioq_* API already defined. And this API is used by GEOM classes _before_ device drivers get the requests AFAIK. For a simple example on a driver, the md-driver might be a good place to look. Note that I have little experience and knowledge of the driver subsystem myself. Also note (from the Hybrid page): * we could not provide support for non work-conserving schedulers, due to a couple of reasons: 1. the assumption, in some drivers, that bioq_disksort() will make requests immediately available (so a subsequent bioq_first() will not return NULL). 2. the fact that there is no bioq_lock()/bioq_unlock(), so the scheduler does not have a safe way to generate requests for a given queue. This certainly argues for having this in the GEOM layer, but perhaps it's possible to change the assumtions done in some drivers? The locking issue should perhaps be better planned though, and an audit of the driver disksort code is necessary. Also: * as said, the ATA driver in 6.x/7.x moves the disksort one layer below the one we are working at, so this particular work won't help on ATA-based 6.x machines. We should figure out how to address this, because the work done at that layer is mostly a replica of the bioq_*() API. So, I see this can get a bit messy thinking of that the ATA drivers does disksorts on its own, but perhaps it would be possible to fix this by letting changing the general ATA driver to use it's own pluggable scheduler. Anyway, I shouldn't demand that you do this, especially since I don't have any code or anything to show to, and because you decide what you want to do. However, I'd hate to see the Hybrid effort go to waste :) I was hoping some of the authors of the project would reply with their thoughts, so I CC'ed them. -- Ulf Lilleengen ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Inner workings of turnstiles and sleepqueues
On Tuesday 16 October 2007 05:41:18 am Ed Schouten wrote: Hello, I asked the following question on questions@, but as requested, I'll forward this question to this list, because of its technical nature. - Forwarded message from Ed Schouten [EMAIL PROTECTED] - Date: Mon, 15 Oct 2007 23:13:01 +0200 From: Ed Schouten [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: Inner workings of turnstiles and sleepqueues Hello, For some reason, I want to understand how the queueing of blocked threads in the kernel works when waiting for a lock, which is if I understand correctly done by the turnstiles and sleepqueues. I'm the proud owner of The Design and Implementation of the FreeBSD Operating System book, but for some reason, I can't find anything about it in the book. Is there a way to obtain information about how they work? I already read the source somewhat, but that shouldn't be an ideal solution, in my opinion. The best option right now is to read the code. There are some comments in both the headers and implementation. -- John Baldwin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Pluggable Disk Scheduler Project
On Wed, Oct 17, 2007 at 03:09:35PM +0200, Ulf Lilleengen wrote: ... discussion on Hybrid vs. GEOM as a suitable location for ... pluggable disk schedulers However, I'd hate to see the Hybrid effort go to waste :) I was hoping some of the authors of the project would reply with their thoughts, so I CC'ed them. we are in good contact with Fabio and i am monitoring the discussion, don't worry. cheers luigi ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Pluggable Disk Scheduler Project
From: Ulf Lilleengen [EMAIL PROTECTED] Date: Wed, Oct 17, 2007 03:09:35PM +0200 On ons, okt 17, 2007 at 02:19:07 +0200, Fabio Checconi wrote: Maybe I've not enough experience/knowledge of the driver subsystem, [...] If you look at it, Hybrid is just a generalization of the existing bioq_* API already defined. And this API is used by GEOM classes _before_ device drivers get the requests AFAIK. I looked at the Hybrid code, but I don't think that the bioq_* family of calls can be the right place to start, for the problems experienced during the Hybrid development with locking/anticipation and because you can have the same request passing through multiple bioqs during its path to the device (e.g., two stacked geoms using two different bioqs and then a device driver using bioq_* to organize its queue, or geoms using more than one bioq, like raid3; I think the complexity can become unmanageable.) One could even think to configure each single bioq in the system, but things can get very complex in this way. For a simple example on a driver, the md-driver might be a good place to look. Note that I have little experience and knowledge of the driver subsystem myself. I'll take a look, thanks. Also note (from the Hybrid page): * we could not provide support for non work-conserving schedulers, due to a [...] This certainly argues for having this in the GEOM layer, but perhaps it's possible to change the assumtions done in some drivers? The locking issue should perhaps be better planned though, and an audit of the driver disksort code is necessary. I need some more time to think about that :) Also: * as said, the ATA driver in 6.x/7.x moves the disksort one layer below the one we are working at, so this particular work won't help on ATA-based 6.x machines. We should figure out how to address this, because the work done at that layer is mostly a replica of the bioq_*() API. So, I see this can get a bit messy thinking of that the ATA drivers does disksorts on its own, but perhaps it would be possible to fix this by letting changing the general ATA driver to use it's own pluggable scheduler. Anyway, I shouldn't demand that you do this, especially since I don't have any code or anything to show to, and because you decide what you want to do. I still cannot say if a GEOM scheduler is better than a scheduler put at a lower level, or if the bioq_* interface is better than any other alternative, so your suggestions are welcome. Moreover I'd really like to discuss/work together, or at least do things with some agreement on them. If I'll have the time to experiment with more than one solution I'll be happy to do that. However, I'd hate to see the Hybrid effort go to waste :) I was hoping some of the authors of the project would reply with their thoughts, so I CC'ed them. Well, the work done on Hybrid had also interesting aspects from the algorithm side... but that's another story... ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Interrupt/speed problems with 6.2 NFS server
Can you also send the output of ps -auxl? Also - do you notice this performance drop when running something like one of the network performance tools? I'd like to isolate the disk activity from the network activity for a clean test.. I tested this with iperf, and while I did see some nfs performance degradation, it did not bog the machine or delay the terminal in the same way. NFS requests were still processed in an acceptable fashion. It was still responsive to commands and ran processes just fine. I would have to say it performed as to be expected during iperf tests (on which I got about 85mbit/sec, which is also to be expected). Interrupts went down to about 2000/sec on em, but the machine did not hang. Something I've noticed is that running 'systat -vm' seems to be part of the problem. If I run the file copy by itself with rdist, it's fast and runs ok. If I run it with systat -vm going, this is when the interrupts jump way up and the machine starts to delay badly. Noticing this, I tried running 'sysctl -a' during the file copy, thinking there was some problem with polling the kernel for certain statistics. Sure enough, sysctl -a delays at 2 spots. Once right after kern.random.sys.harvest.swi: 0 and once again after debug.hashstat.nchash: 131072 61777 6 4713. While it is delayed here (for a couple seconds for each one) the machine is totally hung. Maybe this is a statistics polling issue? Maybe the machine is delayed just long enough in systat -vm to make the nfs clients retry, causing a storm of interrupts? Other systat modes do not seem to cause the same problem (pigs, icmp, ifstat). I do not think the ps or systat output is very accurate, since I can't get them to run when the machine is hung up. I type in the command, but it does not run until the machine springs back to life. I'm not sure how this will affect measurements. http://toric.loungenet.org/~doug/sysctl-a http://toric.loungenet.org/~doug/psauxl http://toric.loungenet.org/~doug/systat-vm My real confusion lies in why there are still em interrupts at all, with polling on. Thanks! --Doug ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: amrd disk performance drop after running under high load
Hi Kris Kennaway wrote: And few hours ago I received feed back from Andrzej Tobola, he has the same problem on FreeBSD 7 with Promise ATA software mirror: Well, he didnt provide any evidence yet that it is the same problem, so let's not become confused by feelings :) I think he is telling about 100% disk busy while processing ~5 transfers/sec. % busy as reported by gstat doesn't mean what you think it does. What is the I/O response time? That's the meaningful statistic for evaluating I/O load. Also you didnt post about this. At the problematic time the disk felt to be very slow, processes all were in reading disk state and vmstat proved it by the % numbers. So I can conclude that FreeBSD has a long standing bug in VM that could be triggered when serving large amount of static data (much bigger than memory size) on high rates. Possibly this only applies to large files like mp3 or video. It is possible, we have further work to do to conclude this though. I forgot to mention I have pmc and kgmon profiling for good and bad times. But I have not enough knowledge to interpret it right and not sure if it can help. pmc would be useful. Unfortunately i've lost pmc profiling results. I'll try to collect it again later. See vmstats in attach (vmstat -z; netstat -m; vmstat -i; vmstat -w 1 | head -11;). Also you can see kgmon profiling results at: http://83.167.98.162/gprof/ With best regards, Alexey Popov ITEM SIZE LIMIT USED FREE REQUESTS FAILURES UMA Kegs: 240,0, 71,4, 71,0 UMA Zones:376,0, 71,9, 71,0 UMA Slabs:128,0, 1011, 62, 243081,0 UMA RCntSlabs:128,0, 361, 1205, 363320,0 UMA Hash: 256,0,4, 11,7,0 16 Bucket:152,0, 45, 30, 72,0 32 Bucket:280,0, 25, 45, 69,0 64 Bucket:536,0, 17, 25, 55, 53 128 Bucket: 1048,0, 287, 88, 1200,95423 VM OBJECT:224,0, 5536,23228, 7675004,0 MAP: 352,0,7, 15,7,0 KMAP ENTRY: 112,90222, 283, 1037, 1207524,0 MAP ENTRY:112,0, 1396, 419, 72221561,0 PV ENTRY: 48, 2244600,17835,30261, 768591673,0 DP fakepg:120,0,0, 31, 10,0 mt_zone: 1024,0, 170,6, 170,0 16:16,0, 3578, 2470, 745206870,0 32:32,0, 1273, 343, 1750850,0 64:64,0, 6147, 1693, 487691440,0 128: 128,0, 4659, 387, 1464251,0 256: 256,0, 596, 2539, 7208469,0 512: 512,0, 608, 253, 791295,0 1024:1024,0, 49, 239,82867,0 2048:2048,0, 27, 295, 115362,0 4096:4096,0, 240, 278, 564659,0 Files:120,0, 544, 324, 263880246,0 TURNSTILE:104,0, 181, 83, 307,0 PROC: 856,0, 82, 82, 308409,0 THREAD: 608,0, 169, 11,24468,0 KSEGRP: 136,0, 165, 69, 165,0 UPCALL:88,0,3, 73,3,0 SLEEPQUEUE:64,0, 181, 99, 307,0 VMSPACE: 544,0, 35, 77, 310929,0 mbuf_packet: 256,0, 368, 115, 1331807039,0 mbuf: 256,0, 2016, 2331, 5433003167,0 mbuf_cluster:2048,32768, 483, 239, 1236143964,0 mbuf_jumbo_pagesize: 4096,0,0,0,0,0 mbuf_jumbo_9k: 9216,0,0,0,0,0 mbuf_jumbo_16k: 16384,0,0,0,0,0 ACL UMA zone: 388,0,0,0,0,0 g_bio:216,0,4, 410, 48175991,0 ata_request: 336,0,0, 22, 24,0 ata_composite:376,0,0,0,0,0 VNODE:
Re: Useful tools missing from /rescue
On Mon, Oct 15, 2007 at 10:38:26AM -0700, David O'Brien wrote: On Sat, Oct 13, 2007 at 10:01:39AM +0400, Yar Tikhiy wrote: On Wed, Oct 03, 2007 at 07:23:44PM -0700, David O'Brien wrote: I also don't see the need for pgrep - I think needing that says your system is running multiuser pretty well. First of all, I'd like to point out that /rescue doesn't need to be as minimal as /stand used to. Now, /rescue is a compact yet versatile set of essential tools that can help in any difficult situation when /*bin:/usr/*bin are unusable for some reason, not only in restoring a broken system while in single-user mode. A .. As for pgrep+pkill, it can come handy if one has screwed up his live system and wants to recover it without dropping the system to single-user. But if we take this just a little bit farther then why don't we go back to a static /[s]bin except for the few things one might need LDAP, etc.. for? That is, what's the purpose in continuing to duplicate /[s]bin into /rescue? /rescue should be just enough to reasonably get a system who's shared libs are messed up working again. Note that /rescue includes the most essential tools from /usr/[s]bin, too. Irrespective of its initial purpose, I regard /rescue as an emergency toolset left aside. In particular, it's good to know it's there when you experiment with a live remote system. A valid objection to this point is that pgrep's job can be done with a combination of ps(1) and sed(1), so it's just a matter of convenience. I guess I'm still having trouble understanding why one would need 'ps' to fix a shared libs issue. Now is a reason to keep adding stuff to IMHO it isn't only shared libs issues that /rescue can help with. /rescue. Also why one would be running 'ps -aux', which is the only way I can think of to get more than one screen of output if a system is in trouble. Imagine that you've rm'ed /usr by accident in a remote shell session. With enough tools in /rescue (which doesn't take lots of tools,) you can stop sensitive daemons, find the backup, restore from it, and get a functional system again without a reboot. No doubt, some tools just make the task easier by providing typical command-line idioms. I don't mean I'm so reckless that I need to restore my /usr often, but the 3-4 megabytes occupied by /rescue are a terribly low price today for being able to shoot different parts of one's foot without necessarily hitting the bone. The price for it in terms of disk space is next to nothing, and there are quite useless space hogs in /rescue already (see below on /rescue/vi.) Considering how few people are skilled in ed(1) these days, we have little choice but include vi. Of course, there should be /rescue/vi, and I have an idea on how to remove its dependence on /usr in a more or less elegant way. I mentioned the not-so-functional /rescue/vi here just to show that we can tolerate certain space waste in /rescue. I won't speak for everyone, but I really like to use fancy shell commands, particularly during hard times: loops, pipelines, etc. So I don't have to enter many commands for a single task or browse I guess I'm not creative enough in the ways I've screwed up my systems and needed tools from /rescue. 8-) Just try to installworld FreeBSD/amd64 over a running FreeBSD/i386. ;-) I don't see the purpose of chown - if you have to fall back to /rescue you're user 'root' - and you're trying to fix enough so you can use standard /*lib /*bin .. Having /rescue/chown is just a matter of completeness of the ch* subset of /rescue tools because chown's job can't be done by any other stock tools. If /rescue is complete enough, one can find more applications for it. E.g., the loader, a kernel, and /rescue /rescue wasn't intended to be well orthogonal. /rescue was part of he corner stone of the deal to switch to shared /[s]bin. But it doesn't confine us to the corner forever. Having an emergency toolset independent of the rest of the system is good in any case. I bet people will experiment and have fun with their systems more eagerly if they know they still can recover quickly with ready tools in case of a serious error. -- Yar ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Pluggable Disk Scheduler Project
On 10/17/07, Fabio Checconi [EMAIL PROTECTED] wrote: From: Ulf Lilleengen [EMAIL PROTECTED] Date: Wed, Oct 17, 2007 03:09:35PM +0200 On ons, okt 17, 2007 at 02:19:07 +0200, Fabio Checconi wrote: Maybe I've not enough experience/knowledge of the driver subsystem, [...] If you look at it, Hybrid is just a generalization of the existing bioq_* API already defined. And this API is used by GEOM classes _before_ device drivers get the requests AFAIK. I looked at the Hybrid code, but I don't think that the bioq_* family of calls can be the right place to start, for the problems experienced during the Hybrid development with locking/anticipation and because you can have the same request passing through multiple bioqs during its path to the device (e.g., two stacked geoms using two different bioqs and then a device driver using bioq_* to organize its queue, or geoms using more than one bioq, like raid3; I think the complexity can become unmanageable.) One could even think to configure each single bioq in the system, but things can get very complex in this way. For a simple example on a driver, the md-driver might be a good place to look. Note that I have little experience and knowledge of the driver subsystem myself. I'll take a look, thanks. Also note (from the Hybrid page): * we could not provide support for non work-conserving schedulers, due to a [...] This certainly argues for having this in the GEOM layer, but perhaps it's possible to change the assumtions done in some drivers? The locking issue should perhaps be better planned though, and an audit of the driver disksort code is necessary. I need some more time to think about that :) Also: * as said, the ATA driver in 6.x/7.x moves the disksort one layer below the one we are working at, so this particular work won't help on ATA-based 6.x machines. We should figure out how to address this, because the work done at that layer is mostly a replica of the bioq_*() API. So, I see this can get a bit messy thinking of that the ATA drivers does disksorts on its own, but perhaps it would be possible to fix this by letting changing the general ATA driver to use it's own pluggable scheduler. Anyway, I shouldn't demand that you do this, especially since I don't have any code or anything to show to, and because you decide what you want to do. I still cannot say if a GEOM scheduler is better than a scheduler put at a lower level, or if the bioq_* interface is better than any other alternative, so your suggestions are welcome. Moreover I'd really like to discuss/work together, or at least do things with some agreement on them. If I'll have the time to experiment with more than one solution I'll be happy to do that. However, I'd hate to see the Hybrid effort go to waste :) I was hoping some of the authors of the project would reply with their thoughts, so I CC'ed them. Well, the work done on Hybrid had also interesting aspects from the algorithm side... but that's another story... ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED] -- This .signature sanitized for your protection ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]