Re: SIGEPIPE after update to 8.1-RC2

2010-07-20 Thread Ruben van Staveren
Hi,

On 18 Jul 2010, at 4:20, Sean wrote:

 On 18/07/2010 1:24 AM, Alex Kozlov wrote:
 Hi, stable
 
 After updating my buildbox from 26 April 8-STABLE
 to 8.1-RC2 I constantly getting SIGEPIPE
 
 
 
 [snip]
 
 I'm getting the same thing; what shell are you using? I changed my shell on 
 one machine from /bin/tcsh to /usr/local/bin/bash and problem disappeared.

Another occasion where this problem acts up:

is marked as broken: does not build** Makefile possibly broken: 
mail/moztraybiff:
grep: write error: Broken pipe
moztraybiff-1.2.4_1
---  Session ended at: Tue, 20 Jul 2010 09:04:41 +0200 (consumed 
00:03:01)/usr/local/sbin/portupgrade:1473:in `get_pkgname': Makefile broken 
(MakefileBrok
enError)
from /usr/local/sbin/portupgrade:623
from /usr/local/sbin/portupgrade:614:in `each'
from /usr/local/sbin/portupgrade:614
from /usr/local/sbin/portupgrade:588:in `catch'
from /usr/local/sbin/portupgrade:588
from /usr/local/lib/ruby/1.8/optparse.rb:1310:in `call'
from /usr/local/lib/ruby/1.8/optparse.rb:1310:in `parse_in_order'
from /usr/local/lib/ruby/1.8/optparse.rb:1306:in `catch'
from /usr/local/lib/ruby/1.8/optparse.rb:1306:in `parse_in_order'
from /usr/local/lib/ruby/1.8/optparse.rb:1254:in `catch'
from /usr/local/lib/ruby/1.8/optparse.rb:1254:in `parse_in_order'
from /usr/local/lib/ruby/1.8/optparse.rb:1248:in `order!'
from /usr/local/lib/ruby/1.8/optparse.rb:1241:in `order'
from /usr/local/sbin/portupgrade:565:in `main'
from /usr/local/lib/ruby/1.8/optparse.rb:791:in `initialize'
from /usr/local/sbin/portupgrade:229:in `new'
from /usr/local/sbin/portupgrade:229:in `main'
from /usr/local/sbin/portupgrade:2213

This happens during a sudo portupgrade -va --batch
my shell is /bin/tcsh too. When I run exec bash after sudo -s and then do the 
portupgrade the problem doesn't show up. 

To me, this is a clear breakage and should be considered a show stopper issue 
for 8.1-RELEASE. All shells should be equally supported, especially when they 
reside in /bin. Is there already an open pr on this ?

Thanks,
Ruben  


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SIGEPIPE after update to 8.1-RC2

2010-07-20 Thread Jeremy Chadwick
On Tue, Jul 20, 2010 at 09:19:39AM +0200, Ruben van Staveren wrote:
 To me, this is a clear breakage and should be considered a show
 stopper issue for 8.1-RELEASE.

Too late for that now...

ftp://ftp4.freebsd.org/pub/FreeBSD/releases/amd64/
ftp://ftp4.freebsd.org/pub/FreeBSD/releases/i386/

-- 
| Jeremy Chadwick   j...@parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 8.1-RC2 MCE caused by some LAPIC/clock changes? (was: 8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2?)

2010-07-20 Thread jhell


On Sat, 17 Jul 2010 14:35, Markus Gebert wrote:
In Message-Id: f744f475-3d2b-4bc6-856a-a5d302aa8...@hostpoint.ch



On 13.07.2010, at 16:02, Markus Gebert wrote:

Unfortunately, I have not been able to get anything useful out the svn 
commit logs, which could explain this. Maybe someone else has an idea 
what could have changed between 7 and 8 to break it, and again between 
8 and CURRENT to magically fix it again.


I tracked this down further. I couldn't easily downgrade my 8.1 
installation to see when the problem was introduced because the zpool 
version used is 14. So I tried to figure out, when the problem was 
solved in CURRENT.


I started with the first possible revision that can boot off my v14 pool 
(r201143, Dec 28, zfs v14 commit). With this revision, I was able to 
trigger the MCE.


Then I took some later revision (rev206010, Apr 1, chosen randomly), and 
I couldn't reproduce the problem. I started narrowing the revisions down 
until I found out, that while on r202386 I'm still able to trigger the 
MCE, r202387 seems to solve the problem on CURRENT:


http://svn.freebsd.org/viewvc/base?view=revisionrevision=202387

Since John Baldwin mentioned this problem could be timing related, it 
seems reasonable, that a clock-related change could be fix it. But this 
commit seems to have been MFC'd to 8-STABLE and 8.1 (at least as far as 
I can tell) along with some other changes to amd64 specific code. I 
thought that maybe these other changes that have been MFC'd could have 
reintroduced the problem later on, but so far I could not reproduce the 
problem with newer CURRENT revisions. So, I actually nailed this one 
done to a single commit on CURRENT, but still cannot tell what the 
actual difference is compared to 8-STABLE/8.1.


Any ideas how to proceed?



Adding to this I remembered some specific commits that caught my attention 
when they happened. Specifically they were to mca.c (locate mca) on my 
machine provided the file paths and svn log provided the commit log.


When you said April and I seen the log it rang a bell.

These may be of interest to you:


r210079 | jhb | 2010-07-14 17:10:14 -0400 (Wed, 14 Jul 2010) | 13 lines

MFC 208507,208556,208621:

Add support for corrected machine check interrupts.  CMCI is a new local 
APIC interrupt that fires when a threshold of corrected machine check 
events is reached.  CMCI also includes a count of events when reporting 
corrected errors in the bank's status register.  Note that individual 
banks may or may not support CMCI.  If they do, each bank includes its own 
threshold register that determines when the interrupt fires.  Currently 
the code uses a very simple strategy where it doubles the threshold on 
each interrupt until it succeeds in throttling the interrupt to occur only 
once a minute (this interval can be tuned via sysctl).  The threshold is 
also adjusted on each hourly poll which will lower the threshold once 
events stop occurring.



r206183 | alc | 2010-04-05 12:11:42 -0400 (Mon, 05 Apr 2010) | 6 lines

MFC r204907, r204913, r205402, r205573, r205573
  Implement AMD's recommended workaround for Erratum 383 on Family 10h
  processors.

  Enable machine check exceptions by default.



And a list of mca.c's within the stable/8 src tree:
/usr/src/sbin/mca/mca.c
/usr/src/sys/amd64/amd64/mca.c
/usr/src/sys/dev/aha/aha_mca.c
/usr/src/sys/dev/buslogic/bt_mca.c
/usr/src/sys/dev/ep/if_ep_mca.c
/usr/src/sys/i386/i386/mca.c
/usr/src/sys/ia64/ia64/mca.c


Regards  Good luck,

--

 jhell

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SIGEPIPE after update to 8.1-RC2

2010-07-20 Thread Sean

On 20/07/2010, at 5:19 PM, Ruben van Staveren wrote:

 Hi,
 
 This happens during a sudo portupgrade -va --batch
 my shell is /bin/tcsh too. When I run exec bash after sudo -s and then do 
 the portupgrade the problem doesn't show up. 
 
 To me, this is a clear breakage and should be considered a show stopper issue 
 for 8.1-RELEASE. All shells should be equally supported, especially when they 
 reside in /bin. Is there already an open pr on this ?
 

No PR from me, and not a chance of a fix to 8.1 at this point, unless it really 
does cause breakage (not just a message, but actually stops things); the tag 
has been laid down and would need to be slid forward.

It's likely to be either of two things... a bug in sh, that using tcsh 
highlights because of differing signal setup; or a bug in tcsh that a bug fix 
in sh highlights. It's a bug that comes and goes in the history of FreeBSD, at 
least since early 2006 (based on 10 seconds with Google - 
http://www.linuxquestions.org/questions/*bsd-17/broken-pipe-432167/)


 Thanks,
   Ruben  
 
 

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Problems replacing failing drive in ZFS pool

2010-07-20 Thread Andriy Gapon
on 20/07/2010 01:04 Garrett Moore said the following:
 Well, hotswapping worked, but now I have a totally different problem. Just
 for reference:
 # zpool offline tank da3
 # camcontrol stop da3
 swap drive
 # camcontrol rescan all
 'da3 lost device, removing device entry'
 # camcontrol rescan all
 'da3 at mpt0 ...', so new drive was found! yay
 # zpool replace tank da3
 *cannot replace da3 with da3: device is too small*
 
 So I looked at the smartctl output for the old and new drive. Old:
 Device Model: WDC WD15EADS-00P8B0
 Serial Number:WD-WMAVU0087717
 Firmware Version: 01.00A01
 User Capacity:1,500,301,910,016 bytes
 
 New:
 Device Model: WDC WD15EADS-00R6B0
 Serial Number:WD-WCAVY4770428
 Firmware Version: 01.00A01
 User Capacity:1,500,300,828,160 bytes
 
 God damnit, Western Digital. What can I do now? It's such a small
 difference, is there a way I can work around this? My other replacement
 drive is the 00R6B0 drive model as well, with the slightly smaller
 capacity.

I second what others have said - crap.
But there could be some hope, not sure.
Can you check what is the actual size used by the pool on the disk?
It should be somewhere in zdb -C output (asize?).
If I remember correctly, that actual size should be a multiple of some rather
large power of two, so it could be that it is smaller than 'User Capacity' of 
both
old and new drives.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: SIGEPIPE after update to 8.1-RC2

2010-07-20 Thread Ruben van Staveren

On 20 Jul 2010, at 10:03, Jeremy Chadwick wrote:

 On Tue, Jul 20, 2010 at 09:19:39AM +0200, Ruben van Staveren wrote:
 To me, this is a clear breakage and should be considered a show
 stopper issue for 8.1-RELEASE.
 
 Too late for that now…

Oh well, errata when the culprit is found…

I've filed this as misc/148781

 
 ftp://ftp4.freebsd.org/pub/FreeBSD/releases/amd64/
 ftp://ftp4.freebsd.org/pub/FreeBSD/releases/i386/

Thanks!

 
 -- 
 | Jeremy Chadwick   j...@parodius.com |
 | Parodius Networking   http://www.parodius.com/ |
 | UNIX Systems Administrator  Mountain View, CA, USA |
 | Making life hard for others since 1977.  PGP: 4BD6C0CB |

Regards,
Ruben___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Problems replacing failing drive in ZFS pool

2010-07-20 Thread Pawel Tyll
Hi guys,

 I second what others have said - crap.
 But there could be some hope, not sure.
 Can you check what is the actual size used by the pool on the disk?
 It should be somewhere in zdb -C output (asize?).
 If I remember correctly, that actual size should be a multiple of some rather
 large power of two, so it could be that it is smaller than 'User Capacity' of 
 both
 old and new drives.
Well, I see some possibilities for creative solution here, using some
ssd (or usb stick or mdconfig as act of desperation) and gconcat, but
it's asking for trouble and should probably be considered a temporary
hack.

What I personally would do is get a 2TB drive and use it instead, with
gpt and -l for label, and replace it as gpt/something. Using 100 or so
MB less than whole disk is also a good idea, as you can see ;)

Cheers and good luck.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


today's 8.1/i386: panic: bad pte

2010-07-20 Thread Mikhail T.
Some part of KDE4's kdm crashed at start-up and seems to have taken the 
entire machine with it:


   kgdb /boot/kernel/kernel /var/crash/vmcore.22
   GNU gdb 6.1.1 [FreeBSD]
   Copyright 2004 Free Software Foundation, Inc.
   GDB is free software, covered by the GNU General Public License, and
   you are
   welcome to change it and/or distribute copies of it under certain
   conditions.
   Type show copying to see the conditions.
   There is absolutely no warranty for GDB.  Type show warranty for
   details.
   This GDB was configured as i386-marcel-freebsd...

   Unread portion of the kernel message buffer:
   6pid 18398 (drkonqi), uid 0: exited on signal 11 (core dumped)
   TPTE at 0xbfca9488  IS ZERO @ VA 2a522000
   panic: bad pte
   Uptime: 2h28m24s
   Physical memory: 1263 MB
   Dumping 195 MB: 180 164 148 132 116 100 84 68 52 36 20 4

   Reading symbols from /boot/kernel/splash_pcx.ko...Reading symbols
   from /boot/kernel/splash_pcx.ko.symbols...done.
   done.
   Loaded symbols for /boot/kernel/splash_pcx.ko
   Reading symbols from /boot/kernel/vesa.ko...Reading symbols from
   /boot/kernel/vesa.ko.symbols...done.
   done.
   Loaded symbols for /boot/kernel/vesa.ko
   Reading symbols from /boot/modules/nvidia.ko...done.
   Loaded symbols for /boot/modules/nvidia.ko
   Reading symbols from /boot/kernel/linux.ko...Reading symbols from
   /boot/kernel/linux.ko.symbols...done.
   done.
   Loaded symbols for /boot/kernel/linux.ko
   Reading symbols from /boot/kernel/acpi.ko...Reading symbols from
   /boot/kernel/acpi.ko.symbols...done.
   done.
   Loaded symbols for /boot/kernel/acpi.ko
   Reading symbols from /boot/kernel/linprocfs.ko...Reading symbols
   from /boot/kernel/linprocfs.ko.symbols...done.
   done.
   Loaded symbols for /boot/kernel/linprocfs.ko
   #0  doadump () at pcpu.h:231
   231 __asm __volatile(movl %%fs:0,%0 : =r (td));
   (kgdb) bt full
   #0  doadump () at pcpu.h:231
   No locals.
   #1  0xc05d10a4 in boot (howto=260) at
   /usr/src/sys/kern/kern_shutdown.c:416
_giantcnt = Variable _giantcnt is not available.
   (kgdb) where
   #0  doadump () at pcpu.h:231
   #1  0xc05d10a4 in boot (howto=260) at
   /usr/src/sys/kern/kern_shutdown.c:416
   #2  0xc05d12b1 in panic (fmt=Variable fmt is not available.
   ) at /usr/src/sys/kern/kern_shutdown.c:590
   #3  0xc07f0406 in pmap_remove_pages (pmap=0xc85bbc78) at
   /usr/src/sys/i386/i386/pmap.c:4198
   #4  0xc079516b in vmspace_exit (td=0xc51f3a00) at
   /usr/src/sys/vm/vm_map.c:409
   #5  0xc05a7253 in exit1 (td=0xc51f3a00, rv=139) at
   /usr/src/sys/kern/kern_exit.c:303
   #6  0xc05d3296 in sigexit (td=0xc51f3a00, sig=139) at
   /usr/src/sys/kern/kern_sig.c:2872
   #7  0xc05d47a8 in postsig (sig=11) at /usr/src/sys/kern/kern_sig.c:2759
   #8  0xc06082f8 in ast (framep=0xe5fafd38) at
   /usr/src/sys/kern/subr_trap.c:234
   #9  0xc07e2c44 in doreti_ast () at
   /usr/src/sys/i386/i386/exception.s:368

Does this look familiar to anyone? Thanks!

   -mi

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: panic: handle_written_inodeblock: bad size

2010-07-20 Thread Jeremy Chadwick
On Mon, Jul 19, 2010 at 01:41:24PM -0700, Jeremy Chadwick wrote:
 On Mon, Jul 19, 2010 at 11:55:59AM -0400, Mikhail T. wrote:
  19.07.2010 07:31, Jeremy Chadwick написав(ла):
  If you boot the machine in single-user, and run fsck manually, are there
  any errors?
  Thanks, Jeremy... I wish, there was a way to learn, /which/
  file-system is giving trouble... However, after sending the question
  out last night, I tried to pkg_delete a package on the machine, and
  was very lucky to see a file-system error (inode something or other)
  before the panic struck. That, at least, told me, which file-system
  was in trouble (/var).
  [...]
  And, IMO, at the very least, *any panic related to a file-system
  must clearly identify the file-system in question*... What do you
  think?

 [...] 
 Assuming work tonight isn't that busy for me, I'll see if I can dedicate
 some cycles to printing this information in the error string you saw.

I spent some time on this tonight.  It's not as simple as it sounds, for
me anyway.  Relevant source bits:

src/sys/ufs/ffs/ffs_softdep.c
src/sys/ufs/ffs/fs.h
src/sys/ufs/ffs/softdep.h

ffs_softdep.c, which is almost 6500 lines, contains a large number of
inode-related functions which can call panic().  Functions which have
easy access to the related inodedep struct are the ones which would be
able to print this information easily.  Sort of.

struct inodedep (see softdep.h) contains a member called id_fs, which is
struct fs (see fs.h).  struct fs contains a member called fs_fsmnt (a
char buffer), which is the name of the mounted filesystem.  fs_fsmnt[0]
should be NULL ('\0') if the filesystem isn't mounted.

So in the case of your panic within handle_written_inodeblock(), it
would be as simple as something like:

u_char *mntpt = NULL;

if (inodedep-id_fs-fs_fsmnt[0] != '\0')
mntpt = inodedep-id_fs-fs_fsmnt;
else
/* XXX do what here? */

Then, the panic() statements later have to do something like this (taken
from real code):

if (dp1-di_db[adp-ad_lbn]!=adp-ad_oldblkno)
panic(%s: %s: %s #%jd mismatch %d != %jd,
handle_written_inodeblock,
(mntpt ? mntpt) : unknown,
direct pointer,
(intmax_t)adp-ad_lbn,
dp1-di_db[adp-ad_lbn],
(intmax_t)adp-ad_oldblkno);

The panic message would look like one of the following:

panic: handle_written_inodeblock: /mnt: direct pointer #nnn mismatch nnn != nnn
panic: handle_written_inodeblock: unknown: direct pointer #nnn mismatch nnn 
!= nnn

The unknown string there is a Bad Idea(tm); see below.

Secondly, this brings up the question: what happens if someone is doing
something like fsck /var, where /var uses soft updates?  /var isn't
mounted when this happens.  Can these inode-related functions get called
during that time?  If so, fs_fsmnt would (in theory -- I haven't tested
in practise) be null.  So in that case, what should get printed as the
filesystem?  Well, this is where the unknown string comes into play.

My first answer was: the name of the device/slice/etc. which the inode
is associated with.

The problem is that I couldn't find a way to get this information, as
it's not stored in struct fs anywhere.  One would have to change the
kernel ABI to pass this down the stack, which changes the ABI and is not
something I'm willing to do (plus there's performance implications as
you're passing something else on the stack per every call).  Of course
there may be a way to get this easily, but I don't see it or know of it.

Thirdly, and this is equally as important: given the repetitive nature
of this code (it would have to be repeated in numerous functions),
making a common function that populates a (global) variable with the
fsname its working on would be ideal.  But I don't know the implication
of this, nor do I see many (I think two?) global variables used within
softdep_ffs.c.

Extending one of the structs to get access to the necessary information
is not as simple as just do it -- there are implications when it comes
to memory usage and so on.  This is not a piece of code to bang on
lightly.

This should probably be discussed on freebsd-hackers, but cross-posting
across 3 separate mailing lists is rude.  If you want to drive this,
cool, but please start a new thread about the matter (wanting the
filesystem or device printed in panic() when things like filesystem
panics happen) on freebsd-hackers.  I'm not subscribed to that list, so
please CC me if you go this route.

-- 
| Jeremy Chadwick   j...@parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org 

Re: today's 8.1/i386: panic: bad pte

2010-07-20 Thread Alan Cox
On Mon, Jul 19, 2010 at 11:40 PM, Mikhail T.
mi+t...@aldan.algebra.commi%2bt...@aldan.algebra.com
 wrote:

 Some part of KDE4's kdm crashed at start-up and seems to have taken the
 entire machine with it:

   kgdb /boot/kernel/kernel /var/crash/vmcore.22
   GNU gdb 6.1.1 [FreeBSD]
   Copyright 2004 Free Software Foundation, Inc.
   GDB is free software, covered by the GNU General Public License, and
   you are
   welcome to change it and/or distribute copies of it under certain
   conditions.
   Type show copying to see the conditions.
   There is absolutely no warranty for GDB.  Type show warranty for
   details.
   This GDB was configured as i386-marcel-freebsd...

   Unread portion of the kernel message buffer:
   6pid 18398 (drkonqi), uid 0: exited on signal 11 (core dumped)
   TPTE at 0xbfca9488  IS ZERO @ VA 2a522000
   panic: bad pte
   Uptime: 2h28m24s
   Physical memory: 1263 MB
   Dumping 195 MB: 180 164 148 132 116 100 84 68 52 36 20 4

   Reading symbols from /boot/kernel/splash_pcx.ko...Reading symbols
   from /boot/kernel/splash_pcx.ko.symbols...done.
   done.
   Loaded symbols for /boot/kernel/splash_pcx.ko
   Reading symbols from /boot/kernel/vesa.ko...Reading symbols from
   /boot/kernel/vesa.ko.symbols...done.
   done.
   Loaded symbols for /boot/kernel/vesa.ko
   Reading symbols from /boot/modules/nvidia.ko...done.
   Loaded symbols for /boot/modules/nvidia.ko
   Reading symbols from /boot/kernel/linux.ko...Reading symbols from
   /boot/kernel/linux.ko.symbols...done.
   done.
   Loaded symbols for /boot/kernel/linux.ko
   Reading symbols from /boot/kernel/acpi.ko...Reading symbols from
   /boot/kernel/acpi.ko.symbols...done.
   done.
   Loaded symbols for /boot/kernel/acpi.ko
   Reading symbols from /boot/kernel/linprocfs.ko...Reading symbols
   from /boot/kernel/linprocfs.ko.symbols...done.
   done.
   Loaded symbols for /boot/kernel/linprocfs.ko
   #0  doadump () at pcpu.h:231
   231 __asm __volatile(movl %%fs:0,%0 : =r (td));
   (kgdb) bt full
   #0  doadump () at pcpu.h:231
   No locals.
   #1  0xc05d10a4 in boot (howto=260) at
   /usr/src/sys/kern/kern_shutdown.c:416
_giantcnt = Variable _giantcnt is not available.
   (kgdb) where
   #0  doadump () at pcpu.h:231
   #1  0xc05d10a4 in boot (howto=260) at
   /usr/src/sys/kern/kern_shutdown.c:416
   #2  0xc05d12b1 in panic (fmt=Variable fmt is not available.
   ) at /usr/src/sys/kern/kern_shutdown.c:590
   #3  0xc07f0406 in pmap_remove_pages (pmap=0xc85bbc78) at
   /usr/src/sys/i386/i386/pmap.c:4198
   #4  0xc079516b in vmspace_exit (td=0xc51f3a00) at
   /usr/src/sys/vm/vm_map.c:409
   #5  0xc05a7253 in exit1 (td=0xc51f3a00, rv=139) at
   /usr/src/sys/kern/kern_exit.c:303
   #6  0xc05d3296 in sigexit (td=0xc51f3a00, sig=139) at
   /usr/src/sys/kern/kern_sig.c:2872
   #7  0xc05d47a8 in postsig (sig=11) at /usr/src/sys/kern/kern_sig.c:2759
   #8  0xc06082f8 in ast (framep=0xe5fafd38) at
   /usr/src/sys/kern/subr_trap.c:234
   #9  0xc07e2c44 in doreti_ast () at
   /usr/src/sys/i386/i386/exception.s:368

 Does this look familiar to anyone? Thanks!


Historically, this panic has indicated flakey memory.  This panic occurs
because a memory location within a page table has unexpectedly changed to
zero.

Alan
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: today's 8.1/i386: panic: bad pte

2010-07-20 Thread Alan Cox

Mikhail T. wrote:

20.07.2010 12:47, Alan Cox написав(ла):
Historically, this panic has indicated flakey memory.  This panic 
occurs because a memory location within a page table has unexpectedly 
changed to zero.
Ouch... Thanks for the hint (maybe, the panic should say something 
like that?)


In any case, is there a way to identify the the flakey DIMM? I did run 
memtest on this box and haven't received any errors... Thanks! Yours,


No, not from the panic message.  If a thorough memtest didn't turn up a 
problem, then I would start looking for another cause.


Alan

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: today's 8.1/i386: panic: bad pte

2010-07-20 Thread Mikhail T.

20.07.2010 12:47, Alan Cox написав(ла):
Historically, this panic has indicated flakey memory.  This panic 
occurs because a memory location within a page table has unexpectedly 
changed to zero.
Ouch... Thanks for the hint (maybe, the panic should say something like 
that?)


In any case, is there a way to identify the the flakey DIMM? I did run 
memtest on this box and haven't received any errors... Thanks! Yours,


   -mi

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 8.1-RC2 MCE caused by some LAPIC/clock changes? (was: 8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2?)

2010-07-20 Thread John Baldwin
On Saturday, July 17, 2010 2:35:21 pm Markus Gebert wrote:
 
 On 13.07.2010, at 16:02, Markus Gebert wrote:
 
  Unfortunately, I have not been able to get anything useful out the svn 
commit logs, which could explain this. Maybe someone else has an idea what 
could have changed between 7 and 8 to break it, and again between 8 and 
CURRENT to magically fix it again.
 
 I tracked this down further. I couldn't easily downgrade my 8.1 installation 
to see when the problem was introduced because the zpool version used is 14. 
So I tried to figure out, when the problem was solved in CURRENT.
 
 I started with the first possible revision that can boot off my v14 pool 
(r201143, Dec 28, zfs v14 commit). With this revision, I was able to trigger 
the MCE.
 
 Then I took some later revision (rev206010, Apr 1, chosen randomly), and I 
couldn't reproduce the problem. I started narrowing the revisions down until I 
found out, that while on r202386 I'm still able to trigger the MCE, r202387 
seems to solve the problem on CURRENT:
 
 http://svn.freebsd.org/viewvc/base?view=revisionrevision=202387

Although this change was MFC'd, it was later disabled by default because it 
causes issues on other machines.  I think there is a tunable you need to set 
in loader.conf to enable it for 8.1.  Attilio (the author of that commit) 
should know which tunable to set.

-- 
John Baldwin
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 8.1-RC2 MCE caused by some LAPIC/clock changes? (was: 8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2?)

2010-07-20 Thread Markus Gebert

On 20.07.2010, at 10:15, jhell wrote:

 Any ideas how to proceed?
 
 
 Adding to this I remembered some specific commits that caught my attention 
 when they happened. Specifically they were to mca.c (locate mca) on my 
 machine provided the file paths and svn log provided the commit log.
 
 When you said April and I seen the log it rang a bell.

Thank you for the hint. We've already tried to reproduce with MCA disabled, and 
didn't succeed. The thing is, without altering the bios default settings, the 
OS doesn't even get an MCE before the system reboots itself showing those 
hypertransport sync flood and pci express fatal error stuff during POST. So 
I guess it's safe to say, that the problem happens before MCA can kick in.


Markus___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: 8.1-RC2 MCE caused by some LAPIC/clock changes? (was: 8.1-RC2 - PCI fatal error or MCE triggered by USB/ehci on Sun X4100M2?)

2010-07-20 Thread Markus Gebert

On 20.07.2010, at 21:59, John Baldwin wrote:

 I started narrowing the revisions down until I 
 found out, that while on r202386 I'm still able to trigger the MCE, r202387 
 seems to solve the problem on CURRENT:
 
 http://svn.freebsd.org/viewvc/base?view=revisionrevision=202387
 
 Although this change was MFC'd, it was later disabled by default because it 
 causes issues on other machines.  I think there is a tunable you need to set 
 in loader.conf to enable it for 8.1.  Attilio (the author of that commit) 
 should know which tunable to set.

Might be this one in sys/amd64/amd64/clock.c:


static int lapic_allclocks = 1;
TUNABLE_INT(machdep.lapic_allclocks, lapic_allclocks);


The r202387 changes put this into local_apic.c, guess it was moved later on (or 
after MFC), and that's why I couldn't find it on 8-stable. And, indeed, this 
tunable seems to be gone again in current. Testing with 
machdep.lapic_allclocks=0 right now. So far it looks very promising. I'll let 
it run overnight.

Another thing though: Today I compared verbose boot output from 8-stable and 
the current box. I saw that the ioapic sets up IRQ routing differently on these 
two systems although the hardware is the same. This seemed not so interesting 
at first, but then I noticed that 8-stable sets up two routes (to lapic0 and 
lapic2, or sometimes lapic3) for IRQ58 (mpt0), while current only uses one 
route (to lapic0).

I used 'cpuset -c -l 0 -x 58' in an attempt to make my 8-stable box behave like 
the one running current. Indeed, this seems to have changed IRQ58 to be routed 
to lapic0 only. And the box was running for hours without showing the symptoms.

I just checked boot verbose outpout of my 8-stable box again (booted with 
machdep.lapic_allclocks=0 as mentioned above). And now it seems to have set up 
IRQ routes just like the current box (one route for IRQ58 to lapic0).

So I don't get which issue came first... If either one is ruled out, the 
problem seems to be gone. Was it the clock issue causing wrong IRQ routing 
setup which in turn causes mpt or the CPU go nuts? Or is mpt having two 
interrupt routes actually a normal thing (then why doesn't current behave this 
way?), but the mpt driver causes strange thins when operating with clock 
issues? Or have I misinterpreted something?

Here's the boot verbose output of ioapic related to interrupts 56 (em0), 57 
(em1) and 58 (mpt0):

 1st X4100M2 - running 8-stable (machdep.lapic_allclocks=1, MCEs can be 
reproduced easily) 
# egrep '^ioapic' boot.normal | egrep 'IRQ 5[678]' | sort
ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 0 vector 55
ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 1 vector 50
ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 0 vector 56
ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 2 vector 50
ioapic2: routing intpin 2 (PCI IRQ 58) to lapic 0 vector 57
ioapic2: routing intpin 2 (PCI IRQ 58) to lapic 3 vector 50


 1st X4100M2 - running 8-stable (machdep.lapic_allclocks=0, test currently 
running, no MCEs so far) 
# egrep '^ioapic' boot.lapic_allclocks0 | egrep 'IRQ 5[678]' | sort
ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 0 vector 55
ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 2 vector 50
ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 0 vector 56
ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 3 vector 50
ioapic2: routing intpin 2 (PCI IRQ 58) to lapic 0 vector 57


 2nd X4100M2 - running current (MCEs cannot be reproduced) 
# dmesg | egrep '^ioapic' | egrep 'IRQ 5[678]' | sort
ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 0 vector 55
ioapic2: routing intpin 0 (PCI IRQ 56) to lapic 2 vector 50
ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 0 vector 56
ioapic2: routing intpin 1 (PCI IRQ 57) to lapic 3 vector 50
ioapic2: routing intpin 2 (PCI IRQ 58) to lapic 0 vector 57



Markus

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Problems replacing failing drive in ZFS pool

2010-07-20 Thread alan bryan


--- On Mon, 7/19/10, Dan Langille d...@langille.org wrote:

 From: Dan Langille d...@langille.org
 Subject: Re: Problems replacing failing drive in ZFS pool
 To: Freddie Cash fjwc...@gmail.com
 Cc: freebsd-stable freebsd-stable@freebsd.org
 Date: Monday, July 19, 2010, 7:07 PM
 On 7/19/2010 12:15 PM, Freddie Cash
 wrote:
  On Mon, Jul 19, 2010 at 8:56 AM, Garrett Mooregarrettmo...@gmail.com 
 wrote:
  So you think it's because when I switch from the
 old disk to the new disk,
  ZFS doesn't realize the disk has changed, and
 thinks the data is just
  corrupt now? Even if that happens, shouldn't the
 pool still be available,
  since it's RAIDZ1 and only one disk has gone
 away?
  
  I think it's because you pull the old drive, boot with
 the new drive,
  the controller re-numbers all the devices (ie da3 is
 now da2, da2 is
  now da1, da1 is now da0, da0 is now da6, etc), and ZFS
 thinks that all
  the drives have changed, thus corrupting the
 pool.  I've had this
  happen on our storage servers a couple of times before
 I started using
  glabel(8) on all our drives (dead drive on RAID
 controller, remove
  drive, reboot for whatever reason, all device nodes
 are renumbered,
  everything goes kablooey).
 
 Can you explain a bit about how you use glabel(8) in
 conjunction with ZFS?  If I can retrofit this into an
 exist ZFS array to make things easier in the future...
 
 8.0-STABLE #0: Fri Mar  5 00:46:11 EST 2010
 
 ]# zpool status
   pool: storage
  state: ONLINE
  scrub: none requested
 config:
 
         NAME       
 STATE     READ WRITE CKSUM
         storage 
    ONLINE   
    0     0 
    0
           raidz1   
 ONLINE       0 
    0     0
             ad8 
    ONLINE   
    0     0 
    0
             ad10   
 ONLINE       0 
    0     0
             ad12   
 ONLINE       0 
    0     0
             ad14   
 ONLINE       0 
    0     0
             ad16   
 ONLINE       0 
    0     0
 
  Of course, always have good backups.  ;)
 
 In my case, this ZFS array is the backup.  ;)
 
 But I'm setting up a tape library, real soon now
 
 -- Dan Langille - http://langille.org/
 ___
 freebsd-stable@freebsd.org
 mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
 

Dan,

Here's how to do it after the fact:

http://unix.derkeiler.com/Mailing-Lists/FreeBSD/current/2009-07/msg00623.html

--Alan Bryan






___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org