Re: Page fault, GEOM problem?? (also: using a ASUS A7N8X-XE/nForce2 utlra400?)
On 23 jan 2006, at 20.01, Johan Ström wrote: On 23 jan 2006, at 14.15, Michael S. Eubanks wrote: On Mon, 2006-01-23 at 10:24 +0100, Johan Ström wrote: On 23 jan 2006, at 09.53, Michael S. Eubanks wrote: On Mon, 2006-01-23 at 06:43 +0100, Johan Ström wrote: Wish I could be of more help. :) Have you tried to toggle the sysctl dma flags? I've seen similar posts in the past with read timeouts caused from dma being enabled. # sysctl -a | grep dma ... hw.ata.ata_dma: 1 === Try turning this one off (1 == 0). hw.ata.atapi_dma: 1 ... Disabling DMA, wouldnt that give me pretty bad performance? -Michael If it was not the problem, you could always change it back. It *should* be possible to simply set the control mode on those two disks (``man rc.early'', ``man atacontrol''). Unfortunately, the problem is noted as errata in several FreeBSD versions tending to appear on SATA disks. I believe this is also a problem with some linux setups. If you google ``FreeBSD hw.ata.ata_dma RELEASE'' you will eventually find the following page relating to Asus motherboards: http://www.ryxi.com/freebsd/63-668-write-dma-other-similar-errors- read.shtml I picked it out based on the following line in the dmesg output: Nov 29 20:46:09 elfi kernel: ACPI APIC Table: ASUS A7V333 I'd say it's worth a shot. You might even try turning both the flags off temporarily to see what you get. Your guess is as good as mine. :) Okay, tried turning it of.. The disk IO speeds went even lower... whoping 9-10MB/s and lots of load ;) And since the crashes comes randomly (haven't been able to reproduce them on deamon) i dont realy want to run it like this.. ;) I did another test. I moved the controller card and the disks to my MSI K8N Neo motherboard (with AMD64 3200+ etc), and immediatly I got write speeds of ~49MB/s: $ dd if=/dev/zero of=bigfile.zero bs=1024 count=124 1024024576 bytes transferred in 21.974227 secs (46601164 bytes/sec) Compared to $ dd if=/dev/zero of=bigfile.zero bs=1024 count=124 1024024576 bytes transferred in 78.897708 secs (12979142 bytes/sec) All tests where done in /dev/mirror/gm0s1f on /usr (ufs, NFS exported, local, soft-updates, acls) Soo.. I guess this mobo is just plain fucked and needs to be replaced with something newer ;) Bad thing is, this is Socket A.. so there isnt so many choices left in the mobo market.. However, i found a ASUS A7N8X-XE NF ULTRA 400 SOCKET A with Nforce2 Ultra 400 chipset.. Does anyone have any knowledge about this chipset? How well does it work with Fbsd? I'll do some googling but if someone is using this successfully or unsuccessfully, please let me know :) Got the board now, everything seems to work great, fine transferspeeds, no crashes so far (1 day..). Lets hope this thread ends here..:) -- Johan
Re: Page fault, GEOM problem??
On Mon, 2006-01-23 at 06:43 +0100, Johan Ström wrote: On 23 jan 2006, at 01.17, Michael S. Eubanks wrote: On Sun, 2006-01-22 at 23:51 +0100, Johan Ström wrote: ...snip... On 22 jan 2006, at 22.58, Michael S. Eubanks wrote: This card does afaik dont have raid functionalitys (I've never read anything about it either on the web, the cards box or anywhere else..). I'm running GENERIC, which does include ataraid.. What does your dmesg identify your card as? atapci0: Promise PDC40518 SATA150 controller port 0xb800-0xb87f, 0xb400-0xb4ff mem 0xfb80-0xfb800fff,0xfb00-0xfb01 irq 19 at device 12.0 on pci0 Is it the same PDC chipset? -- Johan No, I have a different controller. My mistake. I think what is happening is the DMA read command is failing, therefore causing the device to be disconnected, and the kernel can't write to the disk from that point on (this is somewhat obvious given the output below). Nov 29 20:36:54 elfi kernel: subdisk10: detached Nov 29 20:36:54 elfi kernel: ad10: detached Nov 29 20:36:54 elfi kernel: unknown: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=426562704 Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad10s1 disconnected. The message seen from the last line above is generated in any of the following scenarios (from g_mirror.c): 1. Device wasn't running yet, but disk disappear. 2. Disk was active and disapppear. 3. Disk disappear during synchronization process. Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134356992, length=16384)] As far as recovering the disk, I remember seeing something about booting to single user mode and using fsck after a core dump in a previous post. I'm assuming the disks worked initially and that you were able to label them etc? Is there any possibility that the disk state may be altered by a power saving feature or setting in the BIOS and FreeBSD just doesn't know when it happens until the next time it tries to access the disk? For recovering, i've always done a direct reboot, the gmirror rebuilds the mirror and fsck is run. No problems reading labels etc, and never has been, only problem has been these sporadic crashes.. And the read/write performance (see earlier in thread)... This is a server, so all bios setting for powersaving is (should be) shut of. Bios should thus never make the disk go to sleep. Thanks for trying to help! Wish I could be of more help. :) Have you tried to toggle the sysctl dma flags? I've seen similar posts in the past with read timeouts caused from dma being enabled. # sysctl -a | grep dma ... hw.ata.ata_dma: 1 === Try turning this one off (1 == 0). hw.ata.atapi_dma: 1 ... -Michael ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Page fault, GEOM problem??
On 23 jan 2006, at 09.53, Michael S. Eubanks wrote: On Mon, 2006-01-23 at 06:43 +0100, Johan Ström wrote: Wish I could be of more help. :) Have you tried to toggle the sysctl dma flags? I've seen similar posts in the past with read timeouts caused from dma being enabled. # sysctl -a | grep dma ... hw.ata.ata_dma: 1 === Try turning this one off (1 == 0). hw.ata.atapi_dma: 1 ... Disabling DMA, wouldnt that give me pretty bad performance? -Michael ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable- [EMAIL PROTECTED]
Re: Page fault, GEOM problem??
On Mon, 2006-01-23 at 10:24 +0100, Johan Ström wrote: On 23 jan 2006, at 09.53, Michael S. Eubanks wrote: On Mon, 2006-01-23 at 06:43 +0100, Johan Ström wrote: Wish I could be of more help. :) Have you tried to toggle the sysctl dma flags? I've seen similar posts in the past with read timeouts caused from dma being enabled. # sysctl -a | grep dma ... hw.ata.ata_dma: 1 === Try turning this one off (1 == 0). hw.ata.atapi_dma: 1 ... Disabling DMA, wouldnt that give me pretty bad performance? -Michael If it was not the problem, you could always change it back. It *should* be possible to simply set the control mode on those two disks (``man rc.early'', ``man atacontrol''). Unfortunately, the problem is noted as errata in several FreeBSD versions tending to appear on SATA disks. I believe this is also a problem with some linux setups. If you google ``FreeBSD hw.ata.ata_dma RELEASE'' you will eventually find the following page relating to Asus motherboards: http://www.ryxi.com/freebsd/63-668-write-dma-other-similar-errors-read.shtml I picked it out based on the following line in the dmesg output: Nov 29 20:46:09 elfi kernel: ACPI APIC Table: ASUS A7V333 I'd say it's worth a shot. You might even try turning both the flags off temporarily to see what you get. Your guess is as good as mine. :) -Michael ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Page fault, GEOM problem??
I'm coming in very late here, and only have some hearsay. But, a friend of mine has built a new hobby machine, with twin 160G drives on a 3Ware 8006, working as a stripe. He had a bunch of problems with stability of the drives until I gave him a couple of tiny (half size) jumpers, that he put on the drive. Smooth sailing since them. If needed, I can find what the jumpers did. But looking through the controllers doco should give you a clue. Johan Ström wrote: On 23 jan 2006, at 09.53, Michael S. Eubanks wrote: On Mon, 2006-01-23 at 06:43 +0100, Johan Ström wrote: Wish I could be of more help. :) Have you tried to toggle the sysctl dma flags? I've seen similar posts in the past with read timeouts caused from dma being enabled. # sysctl -a | grep dma ... hw.ata.ata_dma: 1 === Try turning this one off (1 == 0). hw.ata.atapi_dma: 1 ... Disabling DMA, wouldnt that give me pretty bad performance? -Michael ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED] -- Paul Root Few people know what to do when hula girls attack. - Sam, age 8 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Page fault, GEOM problem?? (also: using a ASUS A7N8X-XE/nForce2 utlra400?)
On 23 jan 2006, at 14.15, Michael S. Eubanks wrote: On Mon, 2006-01-23 at 10:24 +0100, Johan Ström wrote: On 23 jan 2006, at 09.53, Michael S. Eubanks wrote: On Mon, 2006-01-23 at 06:43 +0100, Johan Ström wrote: Wish I could be of more help. :) Have you tried to toggle the sysctl dma flags? I've seen similar posts in the past with read timeouts caused from dma being enabled. # sysctl -a | grep dma ... hw.ata.ata_dma: 1 === Try turning this one off (1 == 0). hw.ata.atapi_dma: 1 ... Disabling DMA, wouldnt that give me pretty bad performance? -Michael If it was not the problem, you could always change it back. It *should* be possible to simply set the control mode on those two disks (``man rc.early'', ``man atacontrol''). Unfortunately, the problem is noted as errata in several FreeBSD versions tending to appear on SATA disks. I believe this is also a problem with some linux setups. If you google ``FreeBSD hw.ata.ata_dma RELEASE'' you will eventually find the following page relating to Asus motherboards: http://www.ryxi.com/freebsd/63-668-write-dma-other-similar-errors- read.shtml I picked it out based on the following line in the dmesg output: Nov 29 20:46:09 elfi kernel: ACPI APIC Table: ASUS A7V333 I'd say it's worth a shot. You might even try turning both the flags off temporarily to see what you get. Your guess is as good as mine. :) Okay, tried turning it of.. The disk IO speeds went even lower... whoping 9-10MB/s and lots of load ;) And since the crashes comes randomly (haven't been able to reproduce them on deamon) i dont realy want to run it like this.. ;) I did another test. I moved the controller card and the disks to my MSI K8N Neo motherboard (with AMD64 3200+ etc), and immediatly I got write speeds of ~49MB/s: $ dd if=/dev/zero of=bigfile.zero bs=1024 count=124 1024024576 bytes transferred in 21.974227 secs (46601164 bytes/sec) Compared to $ dd if=/dev/zero of=bigfile.zero bs=1024 count=124 1024024576 bytes transferred in 78.897708 secs (12979142 bytes/sec) All tests where done in /dev/mirror/gm0s1f on /usr (ufs, NFS exported, local, soft-updates, acls) Soo.. I guess this mobo is just plain fucked and needs to be replaced with something newer ;) Bad thing is, this is Socket A.. so there isnt so many choices left in the mobo market.. However, i found a ASUS A7N8X-XE NF ULTRA 400 SOCKET A with Nforce2 Ultra 400 chipset.. Does anyone have any knowledge about this chipset? How well does it work with Fbsd? I'll do some googling but if someone is using this successfully or unsuccessfully, please let me know :) -- Johan
Re: Page fault, GEOM problem??
On 23 jan 2006, at 20.16, Paul T. Root wrote: My friends disks are SATA. The jumper was to force the drives to use the SATA 1.x 1.5 gig standard instead of the faster SATA 2.x standard. Older cards can have trouble recognizing newer disks. His were recognized, but very flaky. They've been solid since. These disk should be SATA150 afaik (Maxtor MaXLine III 300Gb). The promise card is named SATAII 150.. So shouldnt be any missmatching. Both card and disks supports NCQ.. Dunno about freebsd on the other hand..Havent found a way to enable/ disable this Johan Ström wrote: On 23 jan 2006, at 15.29, Paul T. Root wrote: I'm coming in very late here, and only have some hearsay. But, a friend of mine has built a new hobby machine, with twin 160G drives on a 3Ware 8006, working as a stripe. He had a bunch of problems with stability of the drives until I gave him a couple of tiny (half size) jumpers, that he put on the drive. Smooth sailing since them. If needed, I can find what the jumpers did. But looking through the controllers doco should give you a clue. As far as I know, SATA drives doesnt have jumpers.. Mine doesnt seem to do atleast.. There are two unused pins but i doubt they are for jumpers.. -- Paul Root Few people know what to do when hula girls attack. - Sam, age 8
Re: Page fault, GEOM problem??
...snip... Can there be problems with the mobo/controllercard? Or is it more likely to be driver realted? Promise lists my motherboard (asus a7v333) in their manual for the controllercard (promise sataII 150 TX4). ...snip... After looking at the dmesg output, I am curious whether you are using the promise sataII 150 TX4 controller for the raid disks? I see you are using 6.0-RELEASE whereas I'm using 5.4-STABLE with that particular controller. My dmesg output for the disk array looks like the following: ad4: 238475MB HDT722525DLA380/V44OA80A [484521/16/63] at ata2-master SATA150 ad6: 238475MB HDS722525VLSA80/V36OA60A [484521/16/63] at ata3-master SATA150 ad8: 238475MB HDT722525DLA380/V44OA80A [484521/16/63] at ata4-master SATA150 ad10: 238475MB HDT722525DLA380/V44OA80A [484521/16/63] at ata5-master SATA150 ar0: 953900MB ATA RAID0 array [65535/255/63] status: READY subdisks: disk0 READY on ad4 at ata2-master disk1 READY on ad6 at ata3-master disk2 READY on ad8 at ata4-master disk3 READY on ad10 at ata5-master The device I mount as my raid filesystem is ar0s1 and I believe it corresponds to ``device ataraid'' in the kernel. I read the raid mirroring page in the handbook, although, I'm thinking your controller should represent each disk as ``ar0'' and handle the mirroring itself (possibly consisting of two sets of two disks). I really don't know though. It looks like the RAID1 mirroring tutorial is for systems that don't actually have a raid controller. Hence, the RAID0 tutorial is the one that I would be using if I did not use the promise controller. Because I _DO_ use the controller, I am simply able to manipulate the ar0 disk array as a single disk. I imagine your setup will differ, but I hope this helps. -Michael ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Page fault, GEOM problem??
On 22 jan 2006, at 22.58, Michael S. Eubanks wrote: ...snip... Can there be problems with the mobo/controllercard? Or is it more likely to be driver realted? Promise lists my motherboard (asus a7v333) in their manual for the controllercard (promise sataII 150 TX4). ...snip... After looking at the dmesg output, I am curious whether you are using the promise sataII 150 TX4 controller for the raid disks? I see you are using 6.0-RELEASE whereas I'm using 5.4-STABLE with that particular controller. My dmesg output for the disk array looks like the following: Hi! Thanks for response! Yes, this is a Promise SATAII 150 TX4 controller.. But afaik it doesnt do raid?? ad4: 238475MB HDT722525DLA380/V44OA80A [484521/16/63] at ata2-master SATA150 ad6: 238475MB HDS722525VLSA80/V36OA60A [484521/16/63] at ata3-master SATA150 ad8: 238475MB HDT722525DLA380/V44OA80A [484521/16/63] at ata4-master SATA150 ad10: 238475MB HDT722525DLA380/V44OA80A [484521/16/63] at ata5- master SATA150 ar0: 953900MB ATA RAID0 array [65535/255/63] status: READY subdisks: disk0 READY on ad4 at ata2-master disk1 READY on ad6 at ata3-master disk2 READY on ad8 at ata4-master disk3 READY on ad10 at ata5-master The device I mount as my raid filesystem is ar0s1 and I believe it corresponds to ``device ataraid'' in the kernel. I read the raid mirroring page in the handbook, although, I'm thinking your controller should represent each disk as ``ar0'' and handle the mirroring itself (possibly consisting of two sets of two disks). I really don't know though. No /dev/ar*.. It looks like the RAID1 mirroring tutorial is for systems that don't actually have a raid controller. Hence, the RAID0 tutorial is the one that I would be using if I did not use the promise controller. Because I _DO_ use the controller, I am simply able to manipulate the ar0 disk array as a single disk. I imagine your setup will differ, but I hope this helps. This card does afaik dont have raid functionalitys (I've never read anything about it either on the web, the cards box or anywhere else..). I'm running GENERIC, which does include ataraid.. What does your dmesg identify your card as? atapci0: Promise PDC40518 SATA150 controller port 0xb800-0xb87f, 0xb400-0xb4ff mem 0xfb80-0xfb800fff,0xfb00-0xfb01 irq 19 at device 12.0 on pci0 Is it the same PDC chipset? -- Johan -Michael ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable- [EMAIL PROTECTED]
Re: Page fault, GEOM problem??
I just checked the specs for the sata II controller on the promise site. It doesn't look like that particular controller is a RAID controller so you can discard my last post. I imagine you have the correct devices. -Michael ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Page fault, GEOM problem??
On 23 jan 2006, at 01.17, Michael S. Eubanks wrote: On Sun, 2006-01-22 at 23:51 +0100, Johan Ström wrote: ...snip... On 22 jan 2006, at 22.58, Michael S. Eubanks wrote: This card does afaik dont have raid functionalitys (I've never read anything about it either on the web, the cards box or anywhere else..). I'm running GENERIC, which does include ataraid.. What does your dmesg identify your card as? atapci0: Promise PDC40518 SATA150 controller port 0xb800-0xb87f, 0xb400-0xb4ff mem 0xfb80-0xfb800fff,0xfb00-0xfb01 irq 19 at device 12.0 on pci0 Is it the same PDC chipset? -- Johan No, I have a different controller. My mistake. I think what is happening is the DMA read command is failing, therefore causing the device to be disconnected, and the kernel can't write to the disk from that point on (this is somewhat obvious given the output below). Nov 29 20:36:54 elfi kernel: subdisk10: detached Nov 29 20:36:54 elfi kernel: ad10: detached Nov 29 20:36:54 elfi kernel: unknown: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=426562704 Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad10s1 disconnected. The message seen from the last line above is generated in any of the following scenarios (from g_mirror.c): 1. Device wasn't running yet, but disk disappear. 2. Disk was active and disapppear. 3. Disk disappear during synchronization process. Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134356992, length=16384)] As far as recovering the disk, I remember seeing something about booting to single user mode and using fsck after a core dump in a previous post. I'm assuming the disks worked initially and that you were able to label them etc? Is there any possibility that the disk state may be altered by a power saving feature or setting in the BIOS and FreeBSD just doesn't know when it happens until the next time it tries to access the disk? For recovering, i've always done a direct reboot, the gmirror rebuilds the mirror and fsck is run. No problems reading labels etc, and never has been, only problem has been these sporadic crashes.. And the read/write performance (see earlier in thread)... This is a server, so all bios setting for powersaving is (should be) shut of. Bios should thus never make the disk go to sleep. -Michael Thanks for trying to help! -- Johan
Re: Page fault, GEOM problem??
On 29 nov 2005, at 21.10, Johan Ström wrote: On 19 nov 2005, at 00.30, Michal Mertl wrote: Parv wrote: in message [EMAIL PROTECTED], wrote Michal Mertl thusly... Johan Ström wrote: On 18 nov 2005, at 18.43, Xin LI wrote: ... So, it seems it does run savecore after running dumpon and mounting disks etc... Is that wrong? No, this is normal. When you run savecore you need to have mounted filesystems. In order to mount the filesystems they may have to be checked. The fsck program requires big amount of memory to check larger filesystems so the swap has to be enabled. Core dumps are written to the dump device (swap) from the end whereas the swap is normally used from the beginning (or the other way around). Therefore there's quite a big chance that, even when the swap has to be used for fsck, the core dump is intact and usable. Is there any formula to calculate the size of swap to account for fsck core dump while assigning swap size (short of having two swap partitions)? None that I know of. Someone posted to some FreeBSD mailing list some figures about the fsck consumption of memory. I really don't remember, but I think it was something like some MBs of memory per quite a lot of GB of file system space. E.g. that the fsck on normally sized file systems (e.g. at most a couple of hundred GB) doesn't normally cosume all of normally sized memory (=256MB) and thus doesn't need to swap. If the usage of the swap file by fsck corrupts the core dump you may start after next crash in single user mode and run the commands manually (without enabling swap). Is that after kernel (re)boots? And would the commands to be executed be savecore followed by swapon? If the dump got corrupted by fsck, you would have to wait for another crash and dump. Then you would reboot and start in single user mode, repair the file systems without swap enabled (fsck would crash on the large file system(s)) and then run savecore. Swapon is then irrelevant, you probably don't need swap for savecore. After running savecore you can start normally multi user (exit from the single user shell). I didn't try all of that but I believe it should work. Michal I just got another coredump, hadn't had one since the first one. From messages: Nov 29 20:36:54 elfi kernel: subdisk10: detached Nov 29 20:36:54 elfi kernel: ad10: detached Nov 29 20:36:54 elfi kernel: unknown: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=426562704 Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad10s1 disconnected. Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134356992, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134373376, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134389760, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134438912, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=268591104, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=268607488, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=268623872, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5966307328, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5967650816, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5968355328, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5968584704, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5969715200, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5971795968, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5972697088, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16063848960, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16063865344, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16063881728, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16063914496, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16064324096, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16064340480, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16064373248,
Re: Page fault, GEOM problem??
On 29 nov 2005, at 21.10, Johan Ström wrote: I just got another coredump, hadn't had one since the first one. From messages: Nov 29 20:36:54 elfi kernel: subdisk10: detached Nov 29 20:36:54 elfi kernel: ad10: detached Nov 29 20:36:54 elfi kernel: unknown: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=426562704 Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad10s1 disconnected. Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134356992, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134373376, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134389760, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134438912, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=268591104, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=268607488, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=268623872, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5966307328, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5967650816, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5968355328, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5968584704, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5969715200, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5971795968, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5972697088, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16063848960, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16063865344, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16063881728, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16063914496, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16064324096, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16064340480, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16064373248, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16064471552, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=18761523712, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=18762850816, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=18762867200, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=18762883584, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=18762899968, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=18762949120, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=18762965504, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=18846032384, length=131072)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=18846228992, length=131072)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=18846441984, length=131072)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=18846638592, length=131072)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=20110369280, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=2011168, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=20111696384, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=21073961472, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=21073977856, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=21844845056, length=16384)]
Re: Page fault, GEOM problem??
On 19 nov 2005, at 00.30, Michal Mertl wrote: Parv wrote: in message [EMAIL PROTECTED], wrote Michal Mertl thusly... Johan Ström wrote: On 18 nov 2005, at 18.43, Xin LI wrote: ... So, it seems it does run savecore after running dumpon and mounting disks etc... Is that wrong? No, this is normal. When you run savecore you need to have mounted filesystems. In order to mount the filesystems they may have to be checked. The fsck program requires big amount of memory to check larger filesystems so the swap has to be enabled. Core dumps are written to the dump device (swap) from the end whereas the swap is normally used from the beginning (or the other way around). Therefore there's quite a big chance that, even when the swap has to be used for fsck, the core dump is intact and usable. Is there any formula to calculate the size of swap to account for fsck core dump while assigning swap size (short of having two swap partitions)? None that I know of. Someone posted to some FreeBSD mailing list some figures about the fsck consumption of memory. I really don't remember, but I think it was something like some MBs of memory per quite a lot of GB of file system space. E.g. that the fsck on normally sized file systems (e.g. at most a couple of hundred GB) doesn't normally cosume all of normally sized memory (=256MB) and thus doesn't need to swap. If the usage of the swap file by fsck corrupts the core dump you may start after next crash in single user mode and run the commands manually (without enabling swap). Is that after kernel (re)boots? And would the commands to be executed be savecore followed by swapon? If the dump got corrupted by fsck, you would have to wait for another crash and dump. Then you would reboot and start in single user mode, repair the file systems without swap enabled (fsck would crash on the large file system(s)) and then run savecore. Swapon is then irrelevant, you probably don't need swap for savecore. After running savecore you can start normally multi user (exit from the single user shell). I didn't try all of that but I believe it should work. Michal I just got another coredump, hadn't had one since the first one. From messages: Nov 29 20:36:54 elfi kernel: subdisk10: detached Nov 29 20:36:54 elfi kernel: ad10: detached Nov 29 20:36:54 elfi kernel: unknown: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=426562704 Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad10s1 disconnected. Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134356992, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134373376, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134389760, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134438912, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=268591104, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=268607488, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=268623872, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5966307328, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5967650816, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5968355328, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5968584704, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5969715200, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5971795968, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=5972697088, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16063848960, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16063865344, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16063881728, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16063914496, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16064324096, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16064340480, length=16384)] Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=16064373248, length=16384)] Nov 29 20:36:54 elfi kernel:
Re: Page fault, GEOM problem??
On 19 nov 2005, at 02.35, Pawel Jakub Dawidek wrote: On Sat, Nov 19, 2005 at 01:55:57AM +0100, Johan Ström wrote: snip + I just noticed another thing... My disk performance... sucks! :P + + Some examples (from an otherwise unloaded system): + + [EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=1024 count=100 + 100+0 records in + 100+0 records out + 102400 bytes transferred in 77.014797 secs (13296146 bytes/sec) You won't get more with such small block size. Try bs=128k. Hi Can't say that a bigger blocksize did much better.. [EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=128k count=1 1+0 records in 1+0 records out 131072 bytes transferred in 98.519181 secs (13304211 bytes/sec) [EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=512k count=1 ^C3587+0 records in 3587+0 records out 1880621056 bytes transferred in 145.049578 secs (12965367 bytes/sec) [EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=50k count=1 1+0 records in 1+0 records out 51200 bytes transferred in 38.536217 secs (13286203 bytes/sec) All this time, iostats MB/s column wouldnt go over 0.24MB/s... Back on GENERIC: [EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=128k count=1 1+0 records in 1+0 records out 131072 bytes transferred in 99.497358 secs (13173415 bytes/sec) [EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=512k count=1000 1000+0 records in 1000+0 records out 524288000 bytes transferred in 39.019239 secs (13436654 bytes/sec) Still slow.. However, iostat goes up as high as 5.64MB/s on each disk in the mirror. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Page fault, GEOM problem??
On 11/18/05, Johan Ström [EMAIL PROTECTED] wrote: Ok, just got this not so very nice error on a RELENG_6_0 box (built from sources this morning, GENERIC kernel minus drivers I dont use): The network card is the exact same model as the one I used in the test machine, didn't have any problems there.. [...] So, any ideas what this can be? If there were a disk crash, wish I have a hard time believing since I ran powermax (maxtor test program) on both of these disk 3 weeks ago and they have been running fine w/o a single problem since I started using them, why didn't just GEOM kick in and run on the other disk? Pagefaulting is not a way to react if a disk goes dead.. Hope someone can help me/this problem doesn't occur any more... but I suppose that is to much to hope for... Would you please consider trying to obtain a crashdump and send the backtrace so we can investigate more? (Hints can be found at http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html#KERNELDEBUG-OBTAIN) Thanks, -- Xin LI [EMAIL PROTECTED] http://www.delphij.net ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Page fault, GEOM problem??
On 18 nov 2005, at 10.17, Xin LI wrote: On 11/18/05, Johan Ström [EMAIL PROTECTED] wrote: Ok, just got this not so very nice error on a RELENG_6_0 box (built from sources this morning, GENERIC kernel minus drivers I dont use): The network card is the exact same model as the one I used in the test machine, didn't have any problems there.. [...] So, any ideas what this can be? If there were a disk crash, wish I have a hard time believing since I ran powermax (maxtor test program) on both of these disk 3 weeks ago and they have been running fine w/o a single problem since I started using them, why didn't just GEOM kick in and run on the other disk? Pagefaulting is not a way to react if a disk goes dead.. Hope someone can help me/this problem doesn't occur any more... but I suppose that is to much to hope for... Would you please consider trying to obtain a crashdump and send the backtrace so we can investigate more? (Hints can be found at http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers- handbook/kerneldebug.html#KERNELDEBUG-OBTAIN) Thanks for answer Doesnt look like I got any usable dump devices.. When booting i get GEOM_MIRROR: Device gm0s1 created (id=4118114647). GEOM_MIRROR: Device gm0s1: provider ad6s1 detected. GEOM_MIRROR: Device gm0s1: provider ad10s1 detected. GEOM_MIRROR: Device gm0s1: provider ad6s1 activated. GEOM_MIRROR: Device gm0s1: provider mirror/gm0s1 launched. GEOM_MIRROR: Device gm0s1: rebuilding provider ad10s1. Trying to mount root from ufs:/dev/mirror/gm0s1a WARNING: / was not properly dismounted Loading configuration files. No suitable dump device was found. Entropy harvesting: interrupts ethernet point_to_point kickstart . swapon: adding /dev/mirror/gm0s1b as swap device Then naturally: /etc/rc: WARNING: Dump device does not exist. Savecore not run. Looked around in the rc-scripts and tried to figure out what it did, the dumpon script tries to autolookup a good dump device but finds none.. According to the page you linked to, the dumpon command has to be executed AFTER swapon.. Why is the rc scripts trying to run it before swapon then? Anyway, tried to do dumpon manually on my swap drive: $ dumpon -v /dev/mirror/gm0s1b dumpon: ioctl(DIOCSKERNELDUMP): Operation not supported Didn't work too good.. Also tried savecore manually: $ savecore /var/crash/ /dev/mirror/gm0s1b savecore: no dumps found Didnt work very good either (but probably expected since there was no working dumps..) Google showed me some other thread in this list about gmirror swap dump, just a question (if it was supported) w/o any answers tho. Same error as I got. Hope this helps. Thanks again Johan Thanks, -- Xin LI [EMAIL PROTECTED] http://www.delphij.net ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable- [EMAIL PROTECTED] ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Page fault, GEOM problem??
Hi! On 18 nov 2005, at 18.43, Xin LI wrote: Hi, Johan, On 11/18/05, Johan Ström [EMAIL PROTECTED] wrote: On 18 nov 2005, at 10.17, Xin LI wrote: [snip] Doesnt look like I got any usable dump devices.. When booting i get [...] Loading configuration files. No suitable dump device was found. Entropy harvesting: interrupts ethernet point_to_point kickstart . swapon: adding /dev/mirror/gm0s1b as swap device I see, so your both SATA disks are in the same mirror group... Then naturally: /etc/rc: WARNING: Dump device does not exist. Savecore not run. Looked around in the rc-scripts and tried to figure out what it did, the dumpon script tries to autolookup a good dump device but finds none.. Unfortunately, kernel dumps currently does not support every device, for some technical reasons (probably to simplify the crash code so they do not make more mistakes^Wdamages) According to the page you linked to, the dumpon command has to be executed AFTER swapon.. Why is the rc scripts trying to run it before swapon then? I guess this is because that dumpon now can detect dump device automatically, but I'm not quite sure about this. Will look for the reason. I think either Handbook should be updated, or the code should be corrected. What I am very curious is that why dumpon is BEFORE savecore. Maybe I have some misunderstanding... Sorry, partly my misstake.. I think i missunderstod how save savecore works below (when i tried it manually in last mail).. But the messages from above are directly from boot, seems it tries dumpon before savecore? Relevant bootlog from last boot: ad0: 2441MB WDC AC22500L 32.41N35 at ata0-master UDMA33 acd0: CDROM CD-ROM CDU701-F/1.0q at ata1-master PIO4 ad6: 286188MB Maxtor 7L300S0 BANC1G10 at ata3-master SATA150 ad10: 286188MB Maxtor 7L300S0 BANC1G10 at ata5-master SATA150 GEOM_MIRROR: Device gm0s1 created (id=4118114647). GEOM_MIRROR: Device gm0s1: provider ad6s1 detected. GEOM_MIRROR: Device gm0s1: provider ad10s1 detected. GEOM_MIRROR: Device gm0s1: provider ad10s1 activated. GEOM_MIRROR: Device gm0s1: provider ad6s1 activated. GEOM_MIRROR: Device gm0s1: provider mirror/gm0s1 launched. Trying to mount root from ufs:/dev/mirror/gm0s1a Loading configuration files. dumpon: (this DIOCSKERNELDUMP message is probably since i specified dumpdev in rc.conf so it forced useage of gm0s1b instead of letting the scripts autodetect.. ) ioctl(DIOCSKERNELDUMP) : Operation not supported Entropy harvesting: interrupts ethernet point_to_point kickstart . swapon: adding /dev/mirror/gm0s1b as swap device Starting file system checks: /dev/mirror/gm0s1a: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/mirror/gm0s1a: clean, 213811 free (771 frags, 26630 blocks, 0.3% fragmentation) /dev/mirror/gm0s1e: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/mirror/gm0s1e: clean, 1012917 free (85 frags, 126604 blocks, 0.0% fragmentation) /dev/mirror/gm0s1f: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/mirror/gm0s1f: clean, 115955787 free (40747 frags, 14489380 blocks, 0.0% fragmentation) /dev/mirror/gm0s1d: FILE SYSTEM CLEAN; SKIPPING CHECKS /dev/mirror/gm0s1d: clean, 1983354 free (4834 frags, 247315 blocks, 0.2% fragmentation) ifconfig stuff Starting devd. Mounting NFS file systems: . Creating and/or trimming log files: . Starting syslogd. Checking for core dump on /dev/mirror/gm0s1b... savecore: no dumps found Starting named. rest of boot So, it seems it does run savecore after running dumpon and mounting disks etc... Is that wrong? Anyway, tried to do dumpon manually on my swap drive: $ dumpon -v /dev/mirror/gm0s1b dumpon: ioctl(DIOCSKERNELDUMP): Operation not supported Didn't work too good.. Also tried savecore manually: $ savecore /var/crash/ /dev/mirror/gm0s1b savecore: no dumps found (This was my misstake, of course there are no dumps when I didnt have a dump when it crashed..) Didnt work very good either (but probably expected since there was no working dumps..) Google showed me some other thread in this list about gmirror swap dump, just a question (if it was supported) w/o any answers tho. Same error as I got. It seems that this could not be workaround'ed easily. If possible, my suggestion is that you attach a third disk and create a swap partition on it for the crash dump. If this is not feasible, then adding DDB and KDB may give us a chance to catch the panic and you can use trace command at the ddb prompt to obtain a simplified backtrace, and there is good chance that it would reveal what is happening. I have cc'ed to Pawel who is very knowledgeable in this area, and let's see whether he has some better suggestions :-) Okay, just added an old but working 2 gig disk to the system, made it a swap and swapon'ed and: [EMAIL PROTECTED]:~$ dumpon -v /dev/ad0s1b kernel dumps on /dev/ad0s1b Great! :) So, let's see when/if it dies next time... Before I took it down for the dump-disk, it had been running fine for 1d 1h (since boot after crasch), however probably
Re: Page fault, GEOM problem??
Johan Ström wrote: Hi! On 18 nov 2005, at 18.43, Xin LI wrote: Hi, Johan, large snip So, it seems it does run savecore after running dumpon and mounting disks etc... Is that wrong? No, this is normal. When you run savecore you need to have mounted filesystems. In order to mount the filesystems they may have to be checked. The fsck program requires big amount of memory to check larger filesystems so the swap has to be enabled. Core dumps are written to the dump device (swap) from the end whereas the swap is normally used from the beginning (or the other way around). Therefore there's quite a big chance that, even when the swap has to be used for fsck, the core dump is intact and usable. If the usage of the swap file by fsck corrupts the core dump you may start after next crash in single user mode and run the commands manually (without enabling swap). As to why you can write kernel core dumps only to certain devices the answer is that at the time, when the kernel is dumping core, it is usually in pretty bad state, kernel internals may be corrupted and so on. The dumping code is therefore written to be quite low level so that even wedged kernel can be dumped. The dumping code is part of hard disk controller's drivers. The gmirror is quite high-level device and geom itself needs working scheduler so there will probably never be a way to dump on gmirror provided swap. When you issue the dumpon command the check is performed whether the driver for the disk you want to dump on supports kernel core dumps. Michal ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Page fault, GEOM problem??
in message [EMAIL PROTECTED], wrote Michal Mertl thusly... Johan Ström wrote: On 18 nov 2005, at 18.43, Xin LI wrote: ... So, it seems it does run savecore after running dumpon and mounting disks etc... Is that wrong? No, this is normal. When you run savecore you need to have mounted filesystems. In order to mount the filesystems they may have to be checked. The fsck program requires big amount of memory to check larger filesystems so the swap has to be enabled. Core dumps are written to the dump device (swap) from the end whereas the swap is normally used from the beginning (or the other way around). Therefore there's quite a big chance that, even when the swap has to be used for fsck, the core dump is intact and usable. Is there any formula to calculate the size of swap to account for fsck core dump while assigning swap size (short of having two swap partitions)? If the usage of the swap file by fsck corrupts the core dump you may start after next crash in single user mode and run the commands manually (without enabling swap). Is that after kernel (re)boots? And would the commands to be executed be savecore followed by swapon? - Parv -- ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Page fault, GEOM problem??
Parv wrote: in message [EMAIL PROTECTED], wrote Michal Mertl thusly... Johan Ström wrote: On 18 nov 2005, at 18.43, Xin LI wrote: ... So, it seems it does run savecore after running dumpon and mounting disks etc... Is that wrong? No, this is normal. When you run savecore you need to have mounted filesystems. In order to mount the filesystems they may have to be checked. The fsck program requires big amount of memory to check larger filesystems so the swap has to be enabled. Core dumps are written to the dump device (swap) from the end whereas the swap is normally used from the beginning (or the other way around). Therefore there's quite a big chance that, even when the swap has to be used for fsck, the core dump is intact and usable. Is there any formula to calculate the size of swap to account for fsck core dump while assigning swap size (short of having two swap partitions)? None that I know of. Someone posted to some FreeBSD mailing list some figures about the fsck consumption of memory. I really don't remember, but I think it was something like some MBs of memory per quite a lot of GB of file system space. E.g. that the fsck on normally sized file systems (e.g. at most a couple of hundred GB) doesn't normally cosume all of normally sized memory (=256MB) and thus doesn't need to swap. If the usage of the swap file by fsck corrupts the core dump you may start after next crash in single user mode and run the commands manually (without enabling swap). Is that after kernel (re)boots? And would the commands to be executed be savecore followed by swapon? If the dump got corrupted by fsck, you would have to wait for another crash and dump. Then you would reboot and start in single user mode, repair the file systems without swap enabled (fsck would crash on the large file system(s)) and then run savecore. Swapon is then irrelevant, you probably don't need swap for savecore. After running savecore you can start normally multi user (exit from the single user shell). I didn't try all of that but I believe it should work. Michal ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Page fault, GEOM problem??
On 18 nov 2005, at 23.39, Michal Mertl wrote: Johan Ström wrote: Hi! On 18 nov 2005, at 18.43, Xin LI wrote: Hi, Johan, large snip So, it seems it does run savecore after running dumpon and mounting disks etc... Is that wrong? No, this is normal. When you run savecore you need to have mounted filesystems. In order to mount the filesystems they may have to be checked. The fsck program requires big amount of memory to check larger filesystems so the swap has to be enabled. Core dumps are written to the dump device (swap) from the end whereas the swap is normally used from the beginning (or the other way around). Therefore there's quite a big chance that, even when the swap has to be used for fsck, the core dump is intact and usable. If the usage of the swap file by fsck corrupts the core dump you may start after next crash in single user mode and run the commands manually (without enabling swap). As to why you can write kernel core dumps only to certain devices the answer is that at the time, when the kernel is dumping core, it is usually in pretty bad state, kernel internals may be corrupted and so on. The dumping code is therefore written to be quite low level so that even wedged kernel can be dumped. The dumping code is part of hard disk controller's drivers. The gmirror is quite high-level device and geom itself needs working scheduler so there will probably never be a way to dump on gmirror provided swap. When you issue the dumpon command the check is performed whether the driver for the disk you want to dump on supports kernel core dumps. Michal Well that makes sense... Then that is right at least.. :) I just noticed another thing... My disk performance... sucks! :P Some examples (from an otherwise unloaded system): [EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=1024 count=100 100+0 records in 100+0 records out 102400 bytes transferred in 77.014797 secs (13296146 bytes/sec) real1m17.100s user0m0.244s sys 0m10.140s 13MB/s from /dev/zero?? This was to my home dir (gm0s1f, last label on the slice/disk)).. When I'm about to open a new window in screen (ctrl-a-c) it takes forever (or rather, bash takes forever) to init when the above dd is running... Well, iostat during dd: [EMAIL PROTECTED]:~$ iostat tty ad0 ad6 ad10 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 0 164 2.19 0 0.00 50.52 3 0.17 50.99 3 0.17 1 0 1 1 97 0.17MB/s?? Am i missreading these iostats or something?.. Load averages directly after the dd is complete is at 0.36, 0.15, 0.05, so the dd doesnt take that much of aload to make bash work soo slow...Gotta be something else... Running diskinfo -t gives me good values (for /dev/ad6 and /dev/ad10) Transfer rates: outside: 102400 kbytes in 1.846578 sec =55454 kbytes/sec middle:102400 kbytes in 1.879855 sec =54472 kbytes/sec inside:102400 kbytes in 3.147158 sec =32537 kbytes/sec So it shouldnt be the disk itself.. those values are the same as when I hade the disk in the temp system.. However I never did try any dd speedtests there. Btw, tried to do regular cp on a dirtree at some gigs, same slooow speed.. Maybee my customkernel is fuckedup or something? It's just a GENERIC with some nonused devicedrivers removed so it would be strange... I'll recompile during night and test GENERIC tomorrow, reporting back.. Did try to move the cards (network/vga/sata) arround in the PCI ports, in case there were any strange conflicts... No difference except I only got one txerror from xl since last boot (wooh!) No crash so far. -- Johan ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Page fault, GEOM problem??
On Sat, Nov 19, 2005 at 01:55:57AM +0100, Johan Ström wrote: + + On 18 nov 2005, at 23.39, Michal Mertl wrote: + + Johan Ström wrote: + Hi! + + On 18 nov 2005, at 18.43, Xin LI wrote: + + Hi, Johan, + + large snip + + So, it seems it does run savecore after running dumpon and mounting + disks etc... Is that wrong? + + No, this is normal. When you run savecore you need to have mounted + filesystems. In order to mount the filesystems they may have to be + checked. The fsck program requires big amount of memory to check larger + filesystems so the swap has to be enabled. Core dumps are written to the + dump device (swap) from the end whereas the swap is normally used from + the beginning (or the other way around). Therefore there's quite a big + chance that, even when the swap has to be used for fsck, the core dump + is intact and usable. If the usage of the swap file by fsck corrupts the + core dump you may start after next crash in single user mode and run the + commands manually (without enabling swap). + + As to why you can write kernel core dumps only to certain devices the + answer is that at the time, when the kernel is dumping core, it is + usually in pretty bad state, kernel internals may be corrupted and so + on. The dumping code is therefore written to be quite low level so that + even wedged kernel can be dumped. The dumping code is part of hard disk + controller's drivers. The gmirror is quite high-level device and geom + itself needs working scheduler so there will probably never be a way to + dump on gmirror provided swap. When you issue the dumpon command the + check is performed whether the driver for the disk you want to dump on + supports kernel core dumps. + + Michal + + Well that makes sense... Then that is right at least.. :) + + I just noticed another thing... My disk performance... sucks! :P + + Some examples (from an otherwise unloaded system): + + [EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=1024 count=100 + 100+0 records in + 100+0 records out + 102400 bytes transferred in 77.014797 secs (13296146 bytes/sec) You won't get more with such small block size. Try bs=128k. -- Pawel Jakub Dawidek http://www.wheel.pl [EMAIL PROTECTED] http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! pgp7quhqt8Cdm.pgp Description: PGP signature
Page fault, GEOM problem??
Ok, just got this not so very nice error on a RELENG_6_0 box (built from sources this morning, GENERIC kernel minus drivers I dont use): Nov 17 15:35:43 elfi kernel: subdisk10: detached Nov 17 15:35:43 elfi kernel: ad10: detached Nov 17 15:35:43 elfi kernel: unknown: TIMEOUT - READ_DMA retrying (1 retry left) LBA=85720528 Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Device gm0s1: provider ad10s1 disconnected. Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134356992, length=16384)] Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134373376, length=16384)] Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=134438912, length=16384)] Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=268591104, length=16384)] Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=268607488, length=16384)] Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=268623872, length=16384)] Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=268640256, length=16384)] Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=20151026176, length=2048)] Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[WRITE(offset=32299655680, length=8192)] Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[READ(offset=37363671552, length=16384)] Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[READ(offset=38349087232, length=16384)] Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[READ(offset=45453566464, length=16384)] Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6). ad10s1[READ(offset=54459458048, length=131072)] Nov 17 17:59:18 elfi syslogd: kernel boot file is /boot/kernel/kernel Nov 17 17:59:18 elfi kernel: Nov 17 17:59:18 elfi kernel: Nov 17 17:59:18 elfi kernel: Fatal trap 12: page fault while in kernel mode Nov 17 17:59:18 elfi kernel: fault virtual address = 0x48 Nov 17 17:59:18 elfi kernel: fault code = supervisor read, page not present Nov 17 17:59:18 elfi kernel: instruction pointer= 0x20:0xc0506b92 Nov 17 17:59:18 elfi kernel: stack pointer = 0x28:0xd56d7c9c Nov 17 17:59:18 elfi kernel: frame pointer = 0x28:0xd56d7c9c Nov 17 17:59:18 elfi kernel: code segment = base 0x0, limit 0xf, type 0x1b Nov 17 17:59:18 elfi kernel: = DPL 0, pres 1, def32 1, gran 1 Nov 17 17:59:18 elfi kernel: processor eflags = interrupt enabled, resume, IOPL = 0 Nov 17 17:59:18 elfi kernel: current process= 36 (swi4: clock sio) Nov 17 17:59:18 elfi kernel: trap number= 12 Nov 17 17:59:18 elfi kernel: panic: page fault Nov 17 17:59:18 elfi kernel: Uptime: 8h55m1s ad10 and ad6, 2 brand new Maxtor Maxline 300GB SATA, attached to a Promise PDC40518 SATA150 controller, makes a GEOM mirror gm0s1. I've been running this stuff in another test machine (MSI K8N neo Platinum, KT333 chip I believe), and I havent had a single problem. I moved the disks/controllercard to my real server 24 hours ago, with the only apparent problem I seemd to have was this: Nov 17 07:06:12 elfi kernel: xl0: transmission error: 90 Nov 17 07:06:12 elfi kernel: xl0: tx underrun, increasing tx start threshold to 120 bytes Nov 17 07:06:18 elfi kernel: xl0: watchdog timeout Nov 17 07:06:18 elfi kernel: xl0: link state changed to DOWN Nov 17 07:06:18 elfi kernel: vlan5: link state changed to DOWN Nov 17 07:06:20 elfi kernel: xl0: link state changed to UP Nov 17 07:06:20 elfi kernel: vlan5: link state changed to UP Comming and going... these problems just apperade during first 20-30 minutes after boot, then they dissapeared totally (and yes there was plenty of IO on the net going on both during and after these messages). Sometimes i just got the first two messages and nothing happened, but sometimes the watchdog message came and the network died for a minute or so. Here is dmesg from last boot (directly after crash): Copyright (c) 1992-2005 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD 6.0-RELEASE #0: Thu Nov 17 00:49:29 CET 2005 [EMAIL PROTECTED]:/usr/obj/usr/src/sys/ELFI ACPI APIC Table: ASUS A7V333 Timecounter i8254 frequency 1193182 Hz quality 0 CPU: AMD Athlon(TM) XP 1900+ (1599.56-MHz 686-class CPU) Origin = AuthenticAMD Id = 0x662 Stepping = 2 Features=0x383fbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE, MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE AMD Features=0xc0480800SYSCALL,MP,MMX+,3DNow+,3DNow real memory = 536854528 (511 MB) avail memory = 516014080 (492 MB) ioapic0: