Fw: Corrupted Kernel?

Ed O'Rourke Sat, 26 Jul 2003 20:26:01 -0700

----- Original Message ----- 
From: Ed O'Rourke 
To: linux forum 
Sent: Saturday, July 26, 2003 11:11 PM
Subject: Fw: Corrupted Kernel?

----- Original Message ----- 
From: "Lucius, Leland" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Saturday, July 26, 2003 11:30 AM
Subject: Re: Corrupted Kernel?

Cool.  Got what was expected.  Did the "TRACE ST INTO 10480.32" hit at any
other point in the process? 
NO

 It sounds like it's always the 4th IPL that
dies???  That's REALLY weird.

Some questions for you:

1)  After the failure, do you have to restore the volumes from tape again?
Or do
    you just do a manual IPL to get things cleared up?

Its on 3 full 3390-3 packs 62c is /, 62d is /usr, 62e is /opt, so I just ddr disk to 
disk the original 62c from another 3390-3 where the original system was kept.
I do not have to restore 62d or 62e which is /urs and /opt they are fine. When this 
happenes it will not ipl anymore you get the failure until I restore the one pack. 
These are not on minidisks they are dedicated 3390-3 full packs to the linux guest.
I moved the system to different dasd devices to eliminate a possible disk prob with 
the same results. It's runing on a 2105(shark).Raid5. I have another SLES8 test system 
that I copied that today and ipled it 25 times no failures (same Kernel Linux version 
2.4.19-3suse-SMP ([EMAIL PROTECTED]) (gcc version 3.2) #1 SMP Wed Nov 6 22:34:43 UTC 
2002 ) . Monday I will reload the iso images I have, then apply servick pack 2 CD 
which brings the kernel up to 2.4.19-4suse-SMP . I'm just baffeled because it works 
fine if you leave it up it will run for days.. Shutdown and reipl a few times and 
goodbye! Any way to fix this without a full restore.? I can mount the disk on another 
linux system.

2)  Have you tried doing "shutdown -h now" commands and followed by manual
IPLs?
    Does the problem happen then?

YES also fails with shutdown -h and manual ipl's.

> Failed restart . 3 worked this is the 4th.
> 
> Tracing active at IPL                                         
>                                                                   
> 00:  -> 000020D0  MVC   D2FF20001000 >> 00010480    00001000  
>   CC 0                                                                
> 00: B                                                         

I wish I'd asked you to display 10480 and 1000 here.  The 1000 location is
where the bootloader reads the command line into before moving it over to
its final resting place.  Man, if it's screwed up at this point then
something really strange is afoot.  There's not really any "fancy" stuff
going on up to this point and, since it worked 3 times before, it should
have worked the 4th time too.

>                                                                       
> 00:  -> 00010000  BASR  0DD0        CC 0                      
>                                                                       
> 00: D TX10480.16                                              
>                                                                       
> 00: R00010480  5B47656E 6572616C 5D0A556E 69717565 06 
> *�General�.Unique*                                            
>                 
> 00: R00010490  49443D31 6851342E 48624B43 68665276    
> *ID=1hQ4.HbKChfRv*                                            
>                 

So we now know it's happening fairly early on.  Since I don't have a solid
answer, let's do some brainstorming...

"Minidisk cache hosed or CCW interpretation screwed up"
VERY unlikely.  Probably should say "Not possible" since it's so unlikely.

"Corrupt disk"
If the disk were corrupted immediately after loading from tape, then the
first IPL would have failed.  If it is getting corrupted after the IPL, then
it should happen after the first IPL and how likely is it that the same data
(possible, but odd) is involved.

However, if you DO have to reload the disks after the failure, then I would
suspect

"ReIPL code not resetting something"
I reckon that if the ReIPL code didn't reset some control register we could
be grabbing a stale page from somewhere else in storage, but that'd be
unlikely since a couple of reipls worked.

"Some sort of I/O failure"
I can't think of what it would be though.

"Bogus pointer usage"
If the block read from disk and moved to 10480 is okay, then somewhere
between that point and the 0x10000 break could be overlaying 10480.  But,
the "TR STO..." command should have trapped that.  This is all ZIPL code to
the 0x10000 point.

"Old kernel"
Actually, this would be one of the first things you could try.  I noticed
that you last compiled your kernel back in November of 2002.  I just went
back through the IBM patch descriptions and, while I don't fully understand
the implications of most of them, I can imagine that one of the bugs could
be your problem.  And this doesn't include whatever SuSE would have included
in later kernel levels.

Anybody else have any ideas?

Leland

Fw: Corrupted Kernel?

Reply via email to