Re: Page fault, GEOM problem?? (also: using a ASUS A7N8X-XE/nForce2 utlra400?)

2006-01-28 Thread Johan Ström


On 23 jan 2006, at 20.01, Johan Ström wrote:



On 23 jan 2006, at 14.15, Michael S. Eubanks wrote:


On Mon, 2006-01-23 at 10:24 +0100, Johan Ström wrote:

On 23 jan 2006, at 09.53, Michael S. Eubanks wrote:


On Mon, 2006-01-23 at 06:43 +0100, Johan Ström wrote:

Wish I could be of more help. :)  Have you tried to toggle the  
sysctl

dma flags?  I've seen similar posts in the past with read timeouts
caused from dma being enabled.

# sysctl -a | grep dma
...
hw.ata.ata_dma: 1  === Try turning this one off (1 == 0).
hw.ata.atapi_dma: 1
...


Disabling DMA, wouldnt that give me pretty bad performance?


-Michael



If it was not the problem, you could always change it back.  It  
*should*

be possible to simply set the control mode on those two disks (``man
rc.early'', ``man atacontrol'').  Unfortunately, the problem is  
noted as
errata in several FreeBSD versions tending to appear on SATA  
disks.  I

believe this is also a problem with some linux setups.  If you google
``FreeBSD hw.ata.ata_dma RELEASE'' you will eventually find the
following page relating to Asus motherboards:

http://www.ryxi.com/freebsd/63-668-write-dma-other-similar-errors- 
read.shtml


I picked it out based on the following line in the dmesg output:


Nov 29 20:46:09 elfi kernel: ACPI APIC Table: ASUS   A7V333  


I'd say it's worth a shot.  You might even try turning both the flags
off temporarily to see what you get.  Your guess is as good as  
mine.  :)




Okay, tried turning it of.. The disk IO speeds went even lower...  
whoping 9-10MB/s and lots of load ;)
And since the crashes comes randomly (haven't been able to  
reproduce them on deamon) i dont realy want to run it like this.. ;)


I did another test. I moved the controller card and the disks to my  
MSI K8N Neo motherboard (with AMD64 3200+ etc), and immediatly I  
got write speeds of ~49MB/s:


 $ dd if=/dev/zero of=bigfile.zero bs=1024 count=124
1024024576 bytes transferred in 21.974227 secs (46601164 bytes/sec)

Compared to
$ dd if=/dev/zero of=bigfile.zero bs=1024 count=124
1024024576 bytes transferred in 78.897708 secs (12979142 bytes/sec)

All tests where done in
/dev/mirror/gm0s1f on /usr (ufs, NFS exported, local, soft-updates,  
acls)


Soo.. I guess this mobo is just plain fucked and needs to be  
replaced with something newer ;)
Bad thing is, this is Socket A.. so there isnt so many choices left  
in the mobo market..


However, i found a ASUS A7N8X-XE NF ULTRA 400 SOCKET A with Nforce2  
Ultra 400 chipset.. Does anyone have any knowledge about this chipset?
How well does it work with Fbsd? I'll do some googling but if  
someone is using this successfully or unsuccessfully, please let me  
know :)


Got the board now, everything seems to work great, fine  
transferspeeds, no crashes so far (1 day..). Lets hope this thread  
ends here..:)



--
Johan




Re: Page fault, GEOM problem??

2006-01-23 Thread Michael S. Eubanks
On Mon, 2006-01-23 at 06:43 +0100, Johan Ström wrote:
 On 23 jan 2006, at 01.17, Michael S. Eubanks wrote:
 
 
  On Sun, 2006-01-22 at 23:51 +0100, Johan Ström wrote:
 
  ...snip...
 
 
  On 22 jan 2006, at 22.58, Michael S. Eubanks wrote:
  This card does afaik dont have raid functionalitys (I've never read
  anything about it either on the web, the cards box or anywhere  
  else..).
  I'm running GENERIC, which does include ataraid..
  What does your dmesg identify your card as?
 
  atapci0: Promise PDC40518 SATA150 controller port 0xb800-0xb87f,
  0xb400-0xb4ff mem 0xfb80-0xfb800fff,0xfb00-0xfb01 irq 19
  at device 12.0 on pci0
 
  Is it the same PDC chipset?
 
  --
  Johan
 
 
 
  No, I have a different controller.  My mistake.  I think what is
  happening is the DMA read command is failing, therefore causing the
  device to be disconnected, and the kernel can't write to the disk from
  that point on (this is somewhat obvious given the output below).
 
 
  Nov 29 20:36:54 elfi kernel: subdisk10: detached
  Nov 29 20:36:54 elfi kernel: ad10: detached
  Nov 29 20:36:54 elfi kernel: unknown: TIMEOUT - READ_DMA48 retrying
  (1 retry left) LBA=426562704
  Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Device gm0s1: provider
  ad10s1 disconnected.
 
 
  The message seen from the last line above is generated in any of the
  following scenarios (from g_mirror.c):
1. Device wasn't running yet, but disk disappear.
2. Disk was active and disapppear.
3. Disk disappear during synchronization process.
 
 
  Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).
  ad10s1[WRITE(offset=134356992, length=16384)]
 
 
  As far as recovering the disk, I remember seeing something about  
  booting
  to single user mode and using fsck after a core dump in a previous  
  post.
  I'm assuming the disks worked initially and that you were able to  
  label
  them etc?  Is there any possibility that the disk state may be altered
  by a power saving feature or setting in the BIOS and FreeBSD just
  doesn't know when it happens until the next time it tries to access  
  the
  disk?
 
 
 For recovering, i've always done a direct reboot, the gmirror  
 rebuilds the mirror and fsck is run.
 No problems reading labels etc, and never has been, only problem has  
 been these sporadic crashes.. And the read/write performance (see  
 earlier in thread)...
 
 This is a server, so all bios setting for powersaving is (should be)  
 shut of. Bios should thus never make the disk go to sleep.
 
 Thanks for trying to help!

Wish I could be of more help. :)  Have you tried to toggle the sysctl
dma flags?  I've seen similar posts in the past with read timeouts
caused from dma being enabled.

# sysctl -a | grep dma
...
hw.ata.ata_dma: 1  === Try turning this one off (1 == 0).
hw.ata.atapi_dma: 1
...

-Michael

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Page fault, GEOM problem??

2006-01-23 Thread Johan Ström


On 23 jan 2006, at 09.53, Michael S. Eubanks wrote:


On Mon, 2006-01-23 at 06:43 +0100, Johan Ström wrote:

Wish I could be of more help. :)  Have you tried to toggle the sysctl
dma flags?  I've seen similar posts in the past with read timeouts
caused from dma being enabled.

# sysctl -a | grep dma
...
hw.ata.ata_dma: 1  === Try turning this one off (1 == 0).
hw.ata.atapi_dma: 1
...


Disabling DMA, wouldnt that give me pretty bad performance?


-Michael

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable- 
[EMAIL PROTECTED]




Re: Page fault, GEOM problem??

2006-01-23 Thread Michael S. Eubanks
On Mon, 2006-01-23 at 10:24 +0100, Johan Ström wrote:
 On 23 jan 2006, at 09.53, Michael S. Eubanks wrote:
 
  On Mon, 2006-01-23 at 06:43 +0100, Johan Ström wrote:
 
  Wish I could be of more help. :)  Have you tried to toggle the sysctl
  dma flags?  I've seen similar posts in the past with read timeouts
  caused from dma being enabled.
 
  # sysctl -a | grep dma
  ...
  hw.ata.ata_dma: 1  === Try turning this one off (1 == 0).
  hw.ata.atapi_dma: 1
  ...
 
 Disabling DMA, wouldnt that give me pretty bad performance?
 
  -Michael
 

If it was not the problem, you could always change it back.  It *should*
be possible to simply set the control mode on those two disks (``man
rc.early'', ``man atacontrol'').  Unfortunately, the problem is noted as
errata in several FreeBSD versions tending to appear on SATA disks.  I
believe this is also a problem with some linux setups.  If you google
``FreeBSD hw.ata.ata_dma RELEASE'' you will eventually find the
following page relating to Asus motherboards:

http://www.ryxi.com/freebsd/63-668-write-dma-other-similar-errors-read.shtml

I picked it out based on the following line in the dmesg output:

 Nov 29 20:46:09 elfi kernel: ACPI APIC Table: ASUS   A7V333  

I'd say it's worth a shot.  You might even try turning both the flags
off temporarily to see what you get.  Your guess is as good as mine.  :)


-Michael
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Page fault, GEOM problem??

2006-01-23 Thread Paul T. Root

I'm coming in very late here, and only have some
hearsay. But, a friend of mine has built a new hobby
machine, with twin 160G drives on a 3Ware 8006, working as
a stripe. He had a bunch of problems with stability of the drives
until I gave him a couple of tiny (half size) jumpers, that he
put on the drive. Smooth sailing since them. If needed, I can find
what the jumpers did. But looking through the controllers doco
should give you a clue.


Johan Ström wrote:


On 23 jan 2006, at 09.53, Michael S. Eubanks wrote:


On Mon, 2006-01-23 at 06:43 +0100, Johan Ström wrote:

Wish I could be of more help. :)  Have you tried to toggle the sysctl
dma flags?  I've seen similar posts in the past with read timeouts
caused from dma being enabled.

# sysctl -a | grep dma
...
hw.ata.ata_dma: 1  === Try turning this one off (1 == 0).
hw.ata.atapi_dma: 1
...


Disabling DMA, wouldnt that give me pretty bad performance?


-Michael

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]




--
Paul Root
Few people know what to do when hula girls attack. - Sam, age 8

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Page fault, GEOM problem?? (also: using a ASUS A7N8X-XE/nForce2 utlra400?)

2006-01-23 Thread Johan Ström


On 23 jan 2006, at 14.15, Michael S. Eubanks wrote:


On Mon, 2006-01-23 at 10:24 +0100, Johan Ström wrote:

On 23 jan 2006, at 09.53, Michael S. Eubanks wrote:


On Mon, 2006-01-23 at 06:43 +0100, Johan Ström wrote:

Wish I could be of more help. :)  Have you tried to toggle the  
sysctl

dma flags?  I've seen similar posts in the past with read timeouts
caused from dma being enabled.

# sysctl -a | grep dma
...
hw.ata.ata_dma: 1  === Try turning this one off (1 == 0).
hw.ata.atapi_dma: 1
...


Disabling DMA, wouldnt that give me pretty bad performance?


-Michael



If it was not the problem, you could always change it back.  It  
*should*

be possible to simply set the control mode on those two disks (``man
rc.early'', ``man atacontrol'').  Unfortunately, the problem is  
noted as

errata in several FreeBSD versions tending to appear on SATA disks.  I
believe this is also a problem with some linux setups.  If you google
``FreeBSD hw.ata.ata_dma RELEASE'' you will eventually find the
following page relating to Asus motherboards:

http://www.ryxi.com/freebsd/63-668-write-dma-other-similar-errors- 
read.shtml


I picked it out based on the following line in the dmesg output:


Nov 29 20:46:09 elfi kernel: ACPI APIC Table: ASUS   A7V333  


I'd say it's worth a shot.  You might even try turning both the flags
off temporarily to see what you get.  Your guess is as good as  
mine.  :)




Okay, tried turning it of.. The disk IO speeds went even lower...  
whoping 9-10MB/s and lots of load ;)
And since the crashes comes randomly (haven't been able to reproduce  
them on deamon) i dont realy want to run it like this.. ;)


I did another test. I moved the controller card and the disks to my  
MSI K8N Neo motherboard (with AMD64 3200+ etc), and immediatly I got  
write speeds of ~49MB/s:


 $ dd if=/dev/zero of=bigfile.zero bs=1024 count=124
1024024576 bytes transferred in 21.974227 secs (46601164 bytes/sec)

Compared to
$ dd if=/dev/zero of=bigfile.zero bs=1024 count=124
1024024576 bytes transferred in 78.897708 secs (12979142 bytes/sec)

All tests where done in
/dev/mirror/gm0s1f on /usr (ufs, NFS exported, local, soft-updates,  
acls)


Soo.. I guess this mobo is just plain fucked and needs to be replaced  
with something newer ;)
Bad thing is, this is Socket A.. so there isnt so many choices left  
in the mobo market..


However, i found a ASUS A7N8X-XE NF ULTRA 400 SOCKET A with Nforce2  
Ultra 400 chipset.. Does anyone have any knowledge about this chipset?
How well does it work with Fbsd? I'll do some googling but if someone  
is using this successfully or unsuccessfully, please let me know :)


--
Johan

Re: Page fault, GEOM problem??

2006-01-23 Thread Johan Ström

On 23 jan 2006, at 20.16, Paul T. Root wrote:

My friends disks are SATA. The jumper was to force
the drives to use the SATA 1.x 1.5 gig standard instead
of the faster SATA 2.x standard. Older cards can have
trouble recognizing newer disks.

His were recognized, but very flaky. They've been solid
since.



These disk should be SATA150 afaik (Maxtor MaXLine III 300Gb).
The promise card is named SATAII 150..
So shouldnt be any missmatching. Both card and disks supports NCQ..  
Dunno about freebsd on the other hand..Havent found a way to enable/ 
disable this




Johan Ström wrote:


On 23 jan 2006, at 15.29, Paul T. Root wrote:


I'm coming in very late here, and only have some
hearsay. But, a friend of mine has built a new hobby
machine, with twin 160G drives on a 3Ware 8006, working as
a stripe. He had a bunch of problems with stability of the drives
until I gave him a couple of tiny (half size) jumpers, that he
put on the drive. Smooth sailing since them. If needed, I can find
what the jumpers did. But looking through the controllers doco
should give you a clue.

As far as I know, SATA drives doesnt have jumpers.. Mine doesnt  
seem to do atleast.. There are two unused pins but i doubt they  
are for jumpers..




--
Paul Root
Few people know what to do when hula girls attack. - Sam, age 8







Re: Page fault, GEOM problem??

2006-01-22 Thread Michael S. Eubanks

...snip...

  Can there be problems with the mobo/controllercard? Or is it more  
  likely to be driver realted? Promise lists my motherboard (asus  
  a7v333) in their manual for the controllercard (promise sataII 150  
  TX4).
 

...snip...

After looking at the dmesg output, I am curious whether you are using
the promise sataII 150 TX4 controller for the raid disks?  I see you are
using 6.0-RELEASE whereas I'm using 5.4-STABLE with that particular
controller.  My dmesg output for the disk array looks like the
following:

ad4: 238475MB HDT722525DLA380/V44OA80A [484521/16/63] at ata2-master
SATA150
ad6: 238475MB HDS722525VLSA80/V36OA60A [484521/16/63] at ata3-master
SATA150
ad8: 238475MB HDT722525DLA380/V44OA80A [484521/16/63] at ata4-master
SATA150
ad10: 238475MB HDT722525DLA380/V44OA80A [484521/16/63] at ata5-master
SATA150
ar0: 953900MB ATA RAID0 array [65535/255/63] status: READY subdisks:
 disk0 READY on ad4 at ata2-master
 disk1 READY on ad6 at ata3-master
 disk2 READY on ad8 at ata4-master
 disk3 READY on ad10 at ata5-master

The device I mount as my raid filesystem is ar0s1 and I believe it
corresponds to ``device ataraid'' in the kernel.  I read the raid
mirroring page in the handbook, although, I'm thinking your controller
should represent each disk as ``ar0'' and handle the mirroring itself
(possibly consisting of two sets of two disks).  I really don't know
though.

It looks like the RAID1 mirroring tutorial is for systems that don't
actually have a raid controller.  Hence, the RAID0 tutorial is the one
that I would be using if I did not use the promise controller.  Because
I _DO_ use the controller, I am simply able to manipulate the ar0 disk
array as a single disk.  I imagine your setup will differ, but I hope
this helps.

-Michael
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Page fault, GEOM problem??

2006-01-22 Thread Johan Ström

On 22 jan 2006, at 22.58, Michael S. Eubanks wrote:




...snip...



Can there be problems with the mobo/controllercard? Or is it more
likely to be driver realted? Promise lists my motherboard (asus
a7v333) in their manual for the controllercard (promise sataII 150
TX4).




...snip...

After looking at the dmesg output, I am curious whether you are using
the promise sataII 150 TX4 controller for the raid disks?  I see  
you are

using 6.0-RELEASE whereas I'm using 5.4-STABLE with that particular
controller.  My dmesg output for the disk array looks like the
following:



Hi! Thanks for response!
Yes, this is a Promise SATAII 150 TX4 controller.. But afaik it  
doesnt do raid??





ad4: 238475MB HDT722525DLA380/V44OA80A [484521/16/63] at ata2-master
SATA150
ad6: 238475MB HDS722525VLSA80/V36OA60A [484521/16/63] at ata3-master
SATA150
ad8: 238475MB HDT722525DLA380/V44OA80A [484521/16/63] at ata4-master
SATA150
ad10: 238475MB HDT722525DLA380/V44OA80A [484521/16/63] at ata5- 
master

SATA150
ar0: 953900MB ATA RAID0 array [65535/255/63] status: READY subdisks:
 disk0 READY on ad4 at ata2-master
 disk1 READY on ad6 at ata3-master
 disk2 READY on ad8 at ata4-master
 disk3 READY on ad10 at ata5-master

The device I mount as my raid filesystem is ar0s1 and I believe it
corresponds to ``device ataraid'' in the kernel.  I read the raid
mirroring page in the handbook, although, I'm thinking your controller
should represent each disk as ``ar0'' and handle the mirroring itself
(possibly consisting of two sets of two disks).  I really don't know
though.



No /dev/ar*..




It looks like the RAID1 mirroring tutorial is for systems that don't
actually have a raid controller.  Hence, the RAID0 tutorial is the one
that I would be using if I did not use the promise controller.   
Because

I _DO_ use the controller, I am simply able to manipulate the ar0 disk
array as a single disk.  I imagine your setup will differ, but I hope
this helps.



This card does afaik dont have raid functionalitys (I've never read  
anything about it either on the web, the cards box or anywhere else..).

I'm running GENERIC, which does include ataraid..
What does your dmesg identify your card as?

atapci0: Promise PDC40518 SATA150 controller port 0xb800-0xb87f, 
0xb400-0xb4ff mem 0xfb80-0xfb800fff,0xfb00-0xfb01 irq 19  
at device 12.0 on pci0


Is it the same PDC chipset?

--
Johan



-Michael
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable- 
[EMAIL PROTECTED]







Re: Page fault, GEOM problem??

2006-01-22 Thread Michael S. Eubanks

I just checked the specs for the sata II controller on the promise site.
It doesn't look like that particular controller is a RAID controller so
you can discard my last post.  I imagine you have the correct devices.


-Michael
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Page fault, GEOM problem??

2006-01-22 Thread Johan Ström

On 23 jan 2006, at 01.17, Michael S. Eubanks wrote:



On Sun, 2006-01-22 at 23:51 +0100, Johan Ström wrote:

...snip...



On 22 jan 2006, at 22.58, Michael S. Eubanks wrote:
This card does afaik dont have raid functionalitys (I've never read
anything about it either on the web, the cards box or anywhere  
else..).

I'm running GENERIC, which does include ataraid..
What does your dmesg identify your card as?

atapci0: Promise PDC40518 SATA150 controller port 0xb800-0xb87f,
0xb400-0xb4ff mem 0xfb80-0xfb800fff,0xfb00-0xfb01 irq 19
at device 12.0 on pci0

Is it the same PDC chipset?

--
Johan




No, I have a different controller.  My mistake.  I think what is
happening is the DMA read command is failing, therefore causing the
device to be disconnected, and the kernel can't write to the disk from
that point on (this is somewhat obvious given the output below).



Nov 29 20:36:54 elfi kernel: subdisk10: detached
Nov 29 20:36:54 elfi kernel: ad10: detached
Nov 29 20:36:54 elfi kernel: unknown: TIMEOUT - READ_DMA48 retrying
(1 retry left) LBA=426562704
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Device gm0s1: provider
ad10s1 disconnected.



The message seen from the last line above is generated in any of the
following scenarios (from g_mirror.c):
  1. Device wasn't running yet, but disk disappear.
  2. Disk was active and disapppear.
  3. Disk disappear during synchronization process.



Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).
ad10s1[WRITE(offset=134356992, length=16384)]



As far as recovering the disk, I remember seeing something about  
booting
to single user mode and using fsck after a core dump in a previous  
post.
I'm assuming the disks worked initially and that you were able to  
label

them etc?  Is there any possibility that the disk state may be altered
by a power saving feature or setting in the BIOS and FreeBSD just
doesn't know when it happens until the next time it tries to access  
the

disk?



For recovering, i've always done a direct reboot, the gmirror  
rebuilds the mirror and fsck is run.
No problems reading labels etc, and never has been, only problem has  
been these sporadic crashes.. And the read/write performance (see  
earlier in thread)...


This is a server, so all bios setting for powersaving is (should be)  
shut of. Bios should thus never make the disk go to sleep.






-Michael




Thanks for trying to help!
--
Johan

Re: Page fault, GEOM problem??

2006-01-19 Thread Johan Ström

On 29 nov 2005, at 21.10, Johan Ström wrote:

On 19 nov 2005, at 00.30, Michal Mertl wrote:

Parv wrote:

in message [EMAIL PROTECTED], wrote Michal
Mertl thusly...


Johan Ström wrote:


On 18 nov 2005, at 18.43, Xin LI wrote:

...

So, it seems it does run savecore after running dumpon and
mounting  disks etc... Is that wrong?


No, this is normal. When you run savecore you need to have mounted
filesystems. In order to mount the filesystems they may have to be
checked. The fsck program requires big amount of memory to check
larger filesystems so the swap has to be enabled. Core dumps are
written to the dump device (swap) from the end whereas the swap is
normally used from the beginning (or the other way around).
Therefore there's quite a big chance that, even when the swap has
to be used for fsck, the core dump is intact and usable.


Is there any formula to calculate the size of swap to account for
fsck  core dump while assigning swap size (short of having two swap
partitions)?


None that I know of. Someone posted to some FreeBSD mailing list some
figures about the fsck consumption of memory. I really don't  
remember,
but I think it was something like some MBs of memory per quite a  
lot of

GB of file system space. E.g. that the fsck on normally sized file
systems (e.g. at most a couple of hundred GB) doesn't normally cosume
all of normally sized memory (=256MB) and thus doesn't need to  
swap.



If the usage of the swap file by fsck corrupts the core dump you
may start after next crash in single user mode and run the
commands manually (without enabling swap).


Is that after kernel (re)boots?  And would the commands to be
executed be savecore followed by swapon?


If the dump got corrupted by fsck, you would have to wait for another
crash and dump. Then you would reboot and start in single user mode,
repair the file systems without swap enabled (fsck would crash on the
large file system(s)) and then run savecore. Swapon is then  
irrelevant,

you probably don't need swap for savecore. After running savecore you
can start normally multi user (exit from the single user shell).

I didn't try all of that but I believe it should work.

Michal



I just got another coredump, hadn't had one since the first one.  
From messages:


Nov 29 20:36:54 elfi kernel: subdisk10: detached
Nov 29 20:36:54 elfi kernel: ad10: detached
Nov 29 20:36:54 elfi kernel: unknown: TIMEOUT - READ_DMA48 retrying  
(1 retry left) LBA=426562704
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Device gm0s1: provider  
ad10s1 disconnected.
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134356992, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134373376, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134389760, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134438912, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=268591104, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=268607488, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=268623872, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5966307328, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5967650816, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5968355328, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5968584704, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5969715200, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5971795968, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5972697088, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16063848960, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16063865344, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16063881728, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16063914496, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16064324096, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16064340480, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16064373248, 

Re: Page fault, GEOM problem??

2005-12-01 Thread Johan Ström

On 29 nov 2005, at 21.10, Johan Ström wrote:


I just got another coredump, hadn't had one since the first one.  
From messages:


Nov 29 20:36:54 elfi kernel: subdisk10: detached
Nov 29 20:36:54 elfi kernel: ad10: detached
Nov 29 20:36:54 elfi kernel: unknown: TIMEOUT - READ_DMA48 retrying  
(1 retry left) LBA=426562704
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Device gm0s1: provider  
ad10s1 disconnected.
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134356992, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134373376, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134389760, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134438912, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=268591104, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=268607488, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=268623872, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5966307328, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5967650816, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5968355328, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5968584704, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5969715200, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5971795968, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5972697088, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16063848960, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16063865344, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16063881728, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16063914496, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16064324096, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16064340480, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16064373248, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16064471552, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=18761523712, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=18762850816, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=18762867200, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=18762883584, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=18762899968, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=18762949120, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=18762965504, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=18846032384, length=131072)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=18846228992, length=131072)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=18846441984, length=131072)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=18846638592, length=131072)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=20110369280, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=2011168, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=20111696384, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=21073961472, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=21073977856, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=21844845056, length=16384)]

Re: Page fault, GEOM problem??

2005-11-29 Thread Johan Ström

On 19 nov 2005, at 00.30, Michal Mertl wrote:

Parv wrote:

in message [EMAIL PROTECTED], wrote Michal
Mertl thusly...


Johan Ström wrote:


On 18 nov 2005, at 18.43, Xin LI wrote:

...

So, it seems it does run savecore after running dumpon and
mounting  disks etc... Is that wrong?


No, this is normal. When you run savecore you need to have mounted
filesystems. In order to mount the filesystems they may have to be
checked. The fsck program requires big amount of memory to check
larger filesystems so the swap has to be enabled. Core dumps are
written to the dump device (swap) from the end whereas the swap is
normally used from the beginning (or the other way around).
Therefore there's quite a big chance that, even when the swap has
to be used for fsck, the core dump is intact and usable.


Is there any formula to calculate the size of swap to account for
fsck  core dump while assigning swap size (short of having two swap
partitions)?


None that I know of. Someone posted to some FreeBSD mailing list some
figures about the fsck consumption of memory. I really don't remember,
but I think it was something like some MBs of memory per quite a  
lot of

GB of file system space. E.g. that the fsck on normally sized file
systems (e.g. at most a couple of hundred GB) doesn't normally cosume
all of normally sized memory (=256MB) and thus doesn't need to  
swap.



If the usage of the swap file by fsck corrupts the core dump you
may start after next crash in single user mode and run the
commands manually (without enabling swap).


Is that after kernel (re)boots?  And would the commands to be
executed be savecore followed by swapon?


If the dump got corrupted by fsck, you would have to wait for another
crash and dump. Then you would reboot and start in single user mode,
repair the file systems without swap enabled (fsck would crash on the
large file system(s)) and then run savecore. Swapon is then  
irrelevant,

you probably don't need swap for savecore. After running savecore you
can start normally multi user (exit from the single user shell).

I didn't try all of that but I believe it should work.

Michal



I just got another coredump, hadn't had one since the first one. From  
messages:


Nov 29 20:36:54 elfi kernel: subdisk10: detached
Nov 29 20:36:54 elfi kernel: ad10: detached
Nov 29 20:36:54 elfi kernel: unknown: TIMEOUT - READ_DMA48 retrying  
(1 retry left) LBA=426562704
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Device gm0s1: provider  
ad10s1 disconnected.
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134356992, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134373376, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134389760, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134438912, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=268591104, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=268607488, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=268623872, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5966307328, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5967650816, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5968355328, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5968584704, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5969715200, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5971795968, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=5972697088, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16063848960, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16063865344, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16063881728, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16063914496, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16064324096, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16064340480, length=16384)]
Nov 29 20:36:54 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=16064373248, length=16384)]
Nov 29 20:36:54 elfi kernel: 

Re: Page fault, GEOM problem??

2005-11-19 Thread Johan Ström


On 19 nov 2005, at 02.35, Pawel Jakub Dawidek wrote:


On Sat, Nov 19, 2005 at 01:55:57AM +0100, Johan Ström wrote:

snip

+ I just noticed another thing... My disk performance... sucks! :P
+
+ Some examples (from an otherwise unloaded system):
+
+ [EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero  
bs=1024 count=100

+ 100+0 records in
+ 100+0 records out
+ 102400 bytes transferred in 77.014797 secs (13296146 bytes/sec)

You won't get more with such small block size. Try bs=128k.


Hi
Can't say that a bigger blocksize did much better..

[EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=128k  
count=1

1+0 records in
1+0 records out
131072 bytes transferred in 98.519181 secs (13304211 bytes/sec)

[EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=512k  
count=1

^C3587+0 records in
3587+0 records out
1880621056 bytes transferred in 145.049578 secs (12965367 bytes/sec)

[EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=50k  
count=1

1+0 records in
1+0 records out
51200 bytes transferred in 38.536217 secs (13286203 bytes/sec)

All this time, iostats MB/s column wouldnt go over 0.24MB/s...

Back on GENERIC:

[EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=128k  
count=1

1+0 records in
1+0 records out
131072 bytes transferred in 99.497358 secs (13173415 bytes/sec)

[EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=512k  
count=1000

1000+0 records in
1000+0 records out
524288000 bytes transferred in 39.019239 secs (13436654 bytes/sec)

Still slow.. However, iostat goes up as high as 5.64MB/s on each disk  
in the mirror.






--
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Page fault, GEOM problem??

2005-11-18 Thread Xin LI
On 11/18/05, Johan Ström [EMAIL PROTECTED] wrote:
 Ok, just got this not so very nice error on a RELENG_6_0 box (built
 from sources this morning, GENERIC kernel minus drivers I dont use):
 The network card is the exact same model as the one I used in the
 test machine, didn't have any problems there..
[...]
 So, any ideas what this can be? If there were a disk crash, wish I
 have a hard time believing since I ran powermax (maxtor test program)
 on both of these disk 3 weeks ago and they have been running fine w/o
 a single problem since I started using them, why didn't just GEOM
 kick in and run on the other disk? Pagefaulting is not a way to react
 if a disk goes dead..

 Hope someone can help me/this problem doesn't occur any more... but I
 suppose that is to much to hope for...

Would you please consider trying to obtain a crashdump and send the
backtrace so we can investigate more?

(Hints can be found at
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/kerneldebug.html#KERNELDEBUG-OBTAIN)

Thanks,
--
Xin LI [EMAIL PROTECTED] http://www.delphij.net
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]

Re: Page fault, GEOM problem??

2005-11-18 Thread Johan Ström


On 18 nov 2005, at 10.17, Xin LI wrote:


On 11/18/05, Johan Ström [EMAIL PROTECTED] wrote:

Ok, just got this not so very nice error on a RELENG_6_0 box (built
from sources this morning, GENERIC kernel minus drivers I dont use):
The network card is the exact same model as the one I used in the
test machine, didn't have any problems there..

[...]

So, any ideas what this can be? If there were a disk crash, wish I
have a hard time believing since I ran powermax (maxtor test program)
on both of these disk 3 weeks ago and they have been running fine w/o
a single problem since I started using them, why didn't just GEOM
kick in and run on the other disk? Pagefaulting is not a way to react
if a disk goes dead..

Hope someone can help me/this problem doesn't occur any more... but I
suppose that is to much to hope for...


Would you please consider trying to obtain a crashdump and send the
backtrace so we can investigate more?

(Hints can be found at
http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers- 
handbook/kerneldebug.html#KERNELDEBUG-OBTAIN)




Thanks for answer

Doesnt look like I got any usable dump devices..
When booting i get

GEOM_MIRROR: Device gm0s1 created (id=4118114647).
GEOM_MIRROR: Device gm0s1: provider ad6s1 detected.
GEOM_MIRROR: Device gm0s1: provider ad10s1 detected.
GEOM_MIRROR: Device gm0s1: provider ad6s1 activated.
GEOM_MIRROR: Device gm0s1: provider mirror/gm0s1 launched.
GEOM_MIRROR: Device gm0s1: rebuilding provider ad10s1.
Trying to mount root from ufs:/dev/mirror/gm0s1a
WARNING: / was not properly dismounted
Loading configuration files.
No suitable dump device was found.
Entropy harvesting:
interrupts
ethernet
point_to_point
kickstart
.
swapon: adding /dev/mirror/gm0s1b as swap device

Then naturally:
/etc/rc: WARNING: Dump device does not exist.  Savecore not run.

Looked around in the rc-scripts and tried to figure out what it did,  
the dumpon script

tries to autolookup a good dump device but finds none..
According to the page you linked to, the dumpon command has to be  
executed AFTER swapon.. Why is the rc scripts trying to run it before  
swapon then?

Anyway, tried to do dumpon manually on my swap drive:

$ dumpon -v /dev/mirror/gm0s1b
dumpon: ioctl(DIOCSKERNELDUMP): Operation not supported

Didn't work too good..
Also tried savecore manually:

$ savecore /var/crash/ /dev/mirror/gm0s1b
savecore: no dumps found

Didnt work very good either (but probably expected since there was no  
working dumps..)
Google showed me some other thread in this list about gmirror swap  
dump, just a question (if it was supported) w/o any answers tho. Same  
error as I got.


Hope this helps.
Thanks again

Johan


Thanks,
--
Xin LI [EMAIL PROTECTED] http://www.delphij.net
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable- 
[EMAIL PROTECTED]


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Page fault, GEOM problem??

2005-11-18 Thread Johan Ström

Hi!

On 18 nov 2005, at 18.43, Xin LI wrote:


Hi, Johan,

On 11/18/05, Johan Ström [EMAIL PROTECTED] wrote:

On 18 nov 2005, at 10.17, Xin LI wrote:

[snip]

Doesnt look like I got any usable dump devices..
When booting i get

[...]

Loading configuration files.
No suitable dump device was found.
Entropy harvesting:
interrupts
ethernet
point_to_point
kickstart
.
swapon: adding /dev/mirror/gm0s1b as swap device


I see, so your both SATA disks are in the same mirror group...


Then naturally:
/etc/rc: WARNING: Dump device does not exist.  Savecore not run.

Looked around in the rc-scripts and tried to figure out what it did,
the dumpon script
tries to autolookup a good dump device but finds none..


Unfortunately, kernel dumps currently does not support every device,
for some technical reasons (probably to simplify the crash code so
they do not make more mistakes^Wdamages)


According to the page you linked to, the dumpon command has to be
executed AFTER swapon.. Why is the rc scripts trying to run it before
swapon then?


I guess this is because that dumpon now can detect dump device
automatically, but I'm not quite sure about this.  Will look for the
reason.  I think either Handbook should be updated, or the code should
be corrected.

What I am very curious is that why dumpon is BEFORE savecore.  Maybe
I have some misunderstanding...


Sorry, partly my misstake.. I think i missunderstod how save savecore  
works below (when i tried it manually in last mail)..
But the messages from above are directly from boot, seems it tries  
dumpon before savecore? Relevant bootlog from last boot:



ad0: 2441MB WDC AC22500L 32.41N35 at ata0-master UDMA33
acd0: CDROM CD-ROM CDU701-F/1.0q at ata1-master PIO4
ad6: 286188MB Maxtor 7L300S0 BANC1G10 at ata3-master SATA150
ad10: 286188MB Maxtor 7L300S0 BANC1G10 at ata5-master SATA150
GEOM_MIRROR: Device gm0s1 created (id=4118114647).
GEOM_MIRROR: Device gm0s1: provider ad6s1 detected.
GEOM_MIRROR: Device gm0s1: provider ad10s1 detected.
GEOM_MIRROR: Device gm0s1: provider ad10s1 activated.
GEOM_MIRROR: Device gm0s1: provider ad6s1 activated.
GEOM_MIRROR: Device gm0s1: provider mirror/gm0s1 launched.
Trying to mount root from ufs:/dev/mirror/gm0s1a
Loading configuration files.
dumpon: (this DIOCSKERNELDUMP message is probably since i specified  
dumpdev in rc.conf so it forced useage of gm0s1b instead of letting  
the scripts autodetect.. )

ioctl(DIOCSKERNELDUMP)
:
Operation not supported
Entropy harvesting:
interrupts
ethernet
point_to_point
kickstart
.
swapon: adding /dev/mirror/gm0s1b as swap device
Starting file system checks:
/dev/mirror/gm0s1a: FILE SYSTEM CLEAN; SKIPPING CHECKS
/dev/mirror/gm0s1a: clean, 213811 free (771 frags, 26630 blocks, 0.3%  
fragmentation)

/dev/mirror/gm0s1e: FILE SYSTEM CLEAN; SKIPPING CHECKS
/dev/mirror/gm0s1e: clean, 1012917 free (85 frags, 126604 blocks,  
0.0% fragmentation)

/dev/mirror/gm0s1f: FILE SYSTEM CLEAN; SKIPPING CHECKS
/dev/mirror/gm0s1f: clean, 115955787 free (40747 frags, 14489380  
blocks, 0.0% fragmentation)

/dev/mirror/gm0s1d: FILE SYSTEM CLEAN; SKIPPING CHECKS
/dev/mirror/gm0s1d: clean, 1983354 free (4834 frags, 247315 blocks,  
0.2% fragmentation)

ifconfig stuff
Starting devd.
Mounting NFS file systems:
.
Creating and/or trimming log files:
.
Starting syslogd.
Checking for core dump on /dev/mirror/gm0s1b...
savecore: no dumps found
Starting named.
rest of boot

So, it seems it does run savecore after running dumpon and mounting  
disks etc... Is that wrong?





Anyway, tried to do dumpon manually on my swap drive:

$ dumpon -v /dev/mirror/gm0s1b
dumpon: ioctl(DIOCSKERNELDUMP): Operation not supported

Didn't work too good..
Also tried savecore manually:

$ savecore /var/crash/ /dev/mirror/gm0s1b
savecore: no dumps found


(This was my misstake, of course there are no dumps when I didnt have  
a dump when it crashed..)




Didnt work very good either (but probably expected since there was no
working dumps..)
Google showed me some other thread in this list about gmirror swap
dump, just a question (if it was supported) w/o any answers tho. Same
error as I got.


It seems that this could not be workaround'ed easily.  If possible, my
suggestion is that you attach a third disk and create a swap partition
on it for the crash dump.  If this is not feasible, then adding DDB
and KDB may give us a chance to catch the panic and you can use
trace command at the ddb prompt to obtain a simplified backtrace,
and there is good chance that it would reveal what is happening.

I have cc'ed to Pawel who is very knowledgeable in this area, and
let's see whether he has some better suggestions :-)


Okay, just added an old but working 2 gig disk to the system, made it  
a swap and swapon'ed and:


[EMAIL PROTECTED]:~$ dumpon -v /dev/ad0s1b
kernel dumps on /dev/ad0s1b

Great! :) So, let's see when/if it dies next time... Before I took it  
down for the dump-disk, it had been running fine
for 1d 1h (since boot after crasch), however probably 

Re: Page fault, GEOM problem??

2005-11-18 Thread Michal Mertl
Johan Ström wrote:
 Hi!
 
 On 18 nov 2005, at 18.43, Xin LI wrote:
 
  Hi, Johan,

 large snip

 So, it seems it does run savecore after running dumpon and mounting  
 disks etc... Is that wrong?

No, this is normal. When you run savecore you need to have mounted
filesystems. In order to mount the filesystems they may have to be
checked. The fsck program requires big amount of memory to check larger
filesystems so the swap has to be enabled. Core dumps are written to the
dump device (swap) from the end whereas the swap is normally used from
the beginning (or the other way around). Therefore there's quite a big
chance that, even when the swap has to be used for fsck, the core dump
is intact and usable. If the usage of the swap file by fsck corrupts the
core dump you may start after next crash in single user mode and run the
commands manually (without enabling swap).

As to why you can write kernel core dumps only to certain devices the
answer is that at the time, when the kernel is dumping core, it is
usually in pretty bad state, kernel internals may be corrupted and so
on. The dumping code is therefore written to be quite low level so that
even wedged kernel can be dumped. The dumping code is part of hard disk
controller's drivers. The gmirror is quite high-level device and geom
itself needs working scheduler so there will probably never be a way to
dump on gmirror provided swap. When you issue the dumpon command the
check is performed whether the driver for the disk you want to dump on
supports kernel core dumps.

Michal

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Page fault, GEOM problem??

2005-11-18 Thread Parv
in message [EMAIL PROTECTED], wrote Michal
Mertl thusly...

 Johan Ström wrote:
  
  On 18 nov 2005, at 18.43, Xin LI wrote:
...
  So, it seems it does run savecore after running dumpon and
  mounting  disks etc... Is that wrong?
 
 No, this is normal. When you run savecore you need to have mounted
 filesystems. In order to mount the filesystems they may have to be
 checked. The fsck program requires big amount of memory to check
 larger filesystems so the swap has to be enabled. Core dumps are
 written to the dump device (swap) from the end whereas the swap is
 normally used from the beginning (or the other way around).
 Therefore there's quite a big chance that, even when the swap has
 to be used for fsck, the core dump is intact and usable.

Is there any formula to calculate the size of swap to account for
fsck  core dump while assigning swap size (short of having two swap
partitions)?


 If the usage of the swap file by fsck corrupts the core dump you
 may start after next crash in single user mode and run the
 commands manually (without enabling swap).

Is that after kernel (re)boots?  And would the commands to be
executed be savecore followed by swapon?


  - Parv

-- 

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Page fault, GEOM problem??

2005-11-18 Thread Michal Mertl
Parv wrote:
 in message [EMAIL PROTECTED], wrote Michal
 Mertl thusly...
 
  Johan Ström wrote:
   
   On 18 nov 2005, at 18.43, Xin LI wrote:
 ...
   So, it seems it does run savecore after running dumpon and
   mounting  disks etc... Is that wrong?
  
  No, this is normal. When you run savecore you need to have mounted
  filesystems. In order to mount the filesystems they may have to be
  checked. The fsck program requires big amount of memory to check
  larger filesystems so the swap has to be enabled. Core dumps are
  written to the dump device (swap) from the end whereas the swap is
  normally used from the beginning (or the other way around).
  Therefore there's quite a big chance that, even when the swap has
  to be used for fsck, the core dump is intact and usable.
 
 Is there any formula to calculate the size of swap to account for
 fsck  core dump while assigning swap size (short of having two swap
 partitions)?

None that I know of. Someone posted to some FreeBSD mailing list some
figures about the fsck consumption of memory. I really don't remember,
but I think it was something like some MBs of memory per quite a lot of
GB of file system space. E.g. that the fsck on normally sized file
systems (e.g. at most a couple of hundred GB) doesn't normally cosume
all of normally sized memory (=256MB) and thus doesn't need to swap.

  If the usage of the swap file by fsck corrupts the core dump you
  may start after next crash in single user mode and run the
  commands manually (without enabling swap).
 
 Is that after kernel (re)boots?  And would the commands to be
 executed be savecore followed by swapon?

If the dump got corrupted by fsck, you would have to wait for another
crash and dump. Then you would reboot and start in single user mode,
repair the file systems without swap enabled (fsck would crash on the
large file system(s)) and then run savecore. Swapon is then irrelevant,
you probably don't need swap for savecore. After running savecore you
can start normally multi user (exit from the single user shell).

I didn't try all of that but I believe it should work.

Michal

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Page fault, GEOM problem??

2005-11-18 Thread Johan Ström


On 18 nov 2005, at 23.39, Michal Mertl wrote:


Johan Ström wrote:

Hi!

On 18 nov 2005, at 18.43, Xin LI wrote:


Hi, Johan,


 large snip


So, it seems it does run savecore after running dumpon and mounting
disks etc... Is that wrong?


No, this is normal. When you run savecore you need to have mounted
filesystems. In order to mount the filesystems they may have to be
checked. The fsck program requires big amount of memory to check  
larger
filesystems so the swap has to be enabled. Core dumps are written  
to the

dump device (swap) from the end whereas the swap is normally used from
the beginning (or the other way around). Therefore there's quite a big
chance that, even when the swap has to be used for fsck, the core dump
is intact and usable. If the usage of the swap file by fsck  
corrupts the
core dump you may start after next crash in single user mode and  
run the

commands manually (without enabling swap).

As to why you can write kernel core dumps only to certain devices the
answer is that at the time, when the kernel is dumping core, it is
usually in pretty bad state, kernel internals may be corrupted and so
on. The dumping code is therefore written to be quite low level so  
that
even wedged kernel can be dumped. The dumping code is part of hard  
disk

controller's drivers. The gmirror is quite high-level device and geom
itself needs working scheduler so there will probably never be a  
way to

dump on gmirror provided swap. When you issue the dumpon command the
check is performed whether the driver for the disk you want to dump on
supports kernel core dumps.

Michal


Well that makes sense... Then that is right at least.. :)

I just noticed another thing... My disk performance... sucks! :P

Some examples (from an otherwise unloaded system):

[EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=1024  
count=100

100+0 records in
100+0 records out
102400 bytes transferred in 77.014797 secs (13296146 bytes/sec)

real1m17.100s
user0m0.244s
sys 0m10.140s

13MB/s from /dev/zero?? This was to my home dir (gm0s1f, last label  
on the slice/disk))..
When I'm about to open a new window in screen (ctrl-a-c) it takes  
forever (or rather, bash takes forever) to init when the above dd is  
running...

Well, iostat during dd:

[EMAIL PROTECTED]:~$ iostat
  tty ad0  ad6   
ad10 cpu
tin tout  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us ni sy  
in id
   0  164  2.19   0  0.00  50.52   3  0.17  50.99   3  0.17   1  0   
1  1 97



0.17MB/s?? Am i missreading these iostats or something?..
Load averages directly after the dd is complete is at 0.36, 0.15,  
0.05, so the dd doesnt take that much of aload to make bash work soo  
slow...Gotta be something else...



Running diskinfo -t gives me good values (for /dev/ad6 and /dev/ad10)

Transfer rates:
outside:   102400 kbytes in   1.846578 sec =55454  
kbytes/sec
middle:102400 kbytes in   1.879855 sec =54472  
kbytes/sec
inside:102400 kbytes in   3.147158 sec =32537  
kbytes/sec


So it shouldnt be the disk itself.. those values are the same as when  
I hade the disk in the temp system.. However I never did try any dd  
speedtests there.
Btw, tried to do regular cp on a dirtree at some gigs, same slooow  
speed..


Maybee my customkernel is fuckedup or something? It's just a GENERIC  
with some nonused devicedrivers removed so it would be strange...

I'll recompile during night and test GENERIC tomorrow, reporting back..

Did try to move the cards (network/vga/sata) arround in the PCI  
ports, in case there were any strange conflicts... No difference  
except I only got one txerror from xl since last boot (wooh!)


No crash so far.

--
Johan

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Page fault, GEOM problem??

2005-11-18 Thread Pawel Jakub Dawidek
On Sat, Nov 19, 2005 at 01:55:57AM +0100, Johan Ström wrote:
+ 
+ On 18 nov 2005, at 23.39, Michal Mertl wrote:
+ 
+ Johan Ström wrote:
+ Hi!
+ 
+ On 18 nov 2005, at 18.43, Xin LI wrote:
+ 
+ Hi, Johan,
+ 
+  large snip
+ 
+ So, it seems it does run savecore after running dumpon and mounting
+ disks etc... Is that wrong?
+ 
+ No, this is normal. When you run savecore you need to have mounted
+ filesystems. In order to mount the filesystems they may have to be
+ checked. The fsck program requires big amount of memory to check larger
+ filesystems so the swap has to be enabled. Core dumps are written to the
+ dump device (swap) from the end whereas the swap is normally used from
+ the beginning (or the other way around). Therefore there's quite a big
+ chance that, even when the swap has to be used for fsck, the core dump
+ is intact and usable. If the usage of the swap file by fsck corrupts the
+ core dump you may start after next crash in single user mode and run the
+ commands manually (without enabling swap).
+ 
+ As to why you can write kernel core dumps only to certain devices the
+ answer is that at the time, when the kernel is dumping core, it is
+ usually in pretty bad state, kernel internals may be corrupted and so
+ on. The dumping code is therefore written to be quite low level so that
+ even wedged kernel can be dumped. The dumping code is part of hard disk
+ controller's drivers. The gmirror is quite high-level device and geom
+ itself needs working scheduler so there will probably never be a way to
+ dump on gmirror provided swap. When you issue the dumpon command the
+ check is performed whether the driver for the disk you want to dump on
+ supports kernel core dumps.
+ 
+ Michal
+ 
+ Well that makes sense... Then that is right at least.. :)
+ 
+ I just noticed another thing... My disk performance... sucks! :P
+ 
+ Some examples (from an otherwise unloaded system):
+ 
+ [EMAIL PROTECTED]:/home/johan$ time dd if=/dev/zero of=bigfile.zero bs=1024 
count=100
+ 100+0 records in
+ 100+0 records out
+ 102400 bytes transferred in 77.014797 secs (13296146 bytes/sec)

You won't get more with such small block size. Try bs=128k.

-- 
Pawel Jakub Dawidek   http://www.wheel.pl
[EMAIL PROTECTED]   http://www.FreeBSD.org
FreeBSD committer Am I Evil? Yes, I Am!


pgp7quhqt8Cdm.pgp
Description: PGP signature


Page fault, GEOM problem??

2005-11-17 Thread Johan Ström
Ok, just got this not so very nice error on a RELENG_6_0 box (built  
from sources this morning, GENERIC kernel minus drivers I dont use):


Nov 17 15:35:43 elfi kernel: subdisk10: detached
Nov 17 15:35:43 elfi kernel: ad10: detached
Nov 17 15:35:43 elfi kernel: unknown: TIMEOUT - READ_DMA retrying (1  
retry left) LBA=85720528
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Device gm0s1: provider  
ad10s1 disconnected.
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134356992, length=16384)]
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134373376, length=16384)]
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=134438912, length=16384)]
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=268591104, length=16384)]
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=268607488, length=16384)]
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=268623872, length=16384)]
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=268640256, length=16384)]
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=20151026176, length=2048)]
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[WRITE(offset=32299655680, length=8192)]
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[READ(offset=37363671552, length=16384)]
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[READ(offset=38349087232, length=16384)]
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[READ(offset=45453566464, length=16384)]
Nov 17 15:35:43 elfi kernel: GEOM_MIRROR: Request failed (error=6).  
ad10s1[READ(offset=54459458048, length=131072)]

Nov 17 17:59:18 elfi syslogd: kernel boot file is /boot/kernel/kernel
Nov 17 17:59:18 elfi kernel:
Nov 17 17:59:18 elfi kernel:
Nov 17 17:59:18 elfi kernel: Fatal trap 12: page fault while in  
kernel mode

Nov 17 17:59:18 elfi kernel: fault virtual address  = 0x48
Nov 17 17:59:18 elfi kernel: fault code = supervisor read,  
page not present
Nov 17 17:59:18 elfi kernel: instruction pointer=  
0x20:0xc0506b92
Nov 17 17:59:18 elfi kernel: stack pointer  =  
0x28:0xd56d7c9c
Nov 17 17:59:18 elfi kernel: frame pointer  =  
0x28:0xd56d7c9c
Nov 17 17:59:18 elfi kernel: code segment   = base 0x0,  
limit 0xf, type 0x1b

Nov 17 17:59:18 elfi kernel: = DPL 0, pres 1, def32 1, gran 1
Nov 17 17:59:18 elfi kernel: processor eflags   = interrupt enabled,  
resume, IOPL = 0
Nov 17 17:59:18 elfi kernel: current process= 36 (swi4:  
clock sio)

Nov 17 17:59:18 elfi kernel: trap number= 12
Nov 17 17:59:18 elfi kernel: panic: page fault
Nov 17 17:59:18 elfi kernel: Uptime: 8h55m1s

ad10 and ad6, 2 brand new Maxtor Maxline 300GB SATA, attached to a  
Promise PDC40518 SATA150 controller, makes a GEOM mirror gm0s1.
I've been running this stuff in another test machine (MSI K8N neo  
Platinum, KT333 chip I believe), and I havent had a single problem. I  
moved the disks/controllercard to my real server 24 hours ago, with  
the only apparent problem I seemd to have was this:


Nov 17 07:06:12 elfi kernel: xl0: transmission error: 90
Nov 17 07:06:12 elfi kernel: xl0: tx underrun, increasing tx start  
threshold to 120 bytes

Nov 17 07:06:18 elfi kernel: xl0: watchdog timeout
Nov 17 07:06:18 elfi kernel: xl0: link state changed to DOWN
Nov 17 07:06:18 elfi kernel: vlan5: link state changed to DOWN
Nov 17 07:06:20 elfi kernel: xl0: link state changed to UP
Nov 17 07:06:20 elfi kernel: vlan5: link state changed to UP

Comming and going... these problems just apperade during first 20-30  
minutes after boot, then they dissapeared totally (and yes there was  
plenty of IO on the net going on both during and after these  
messages). Sometimes i just got the first two messages and nothing  
happened, but sometimes the watchdog message came and the network  
died for a minute or so.


Here is dmesg from last boot (directly after crash):

Copyright (c) 1992-2005 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights  
reserved.

FreeBSD 6.0-RELEASE #0: Thu Nov 17 00:49:29 CET 2005
[EMAIL PROTECTED]:/usr/obj/usr/src/sys/ELFI
ACPI APIC Table: ASUS   A7V333  
Timecounter i8254 frequency 1193182 Hz quality 0
CPU: AMD Athlon(TM) XP 1900+ (1599.56-MHz 686-class CPU)
  Origin = AuthenticAMD  Id = 0x662  Stepping = 2
   
Features=0x383fbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE, 
MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE

  AMD Features=0xc0480800SYSCALL,MP,MMX+,3DNow+,3DNow
real memory  = 536854528 (511 MB)
avail memory = 516014080 (492 MB)
ioapic0: