Cool.  Got what was expected.  Did the "TRACE ST INTO 10480.32" hit at any
other point in the process?  It sounds like it's always the 4th IPL that
dies???  That's REALLY weird.

Some questions for you:

1)  After the failure, do you have to restore the volumes from tape again?
Or do
    you just do a manual IPL to get things cleared up?
2)  Have you tried doing "shutdown -h now" commands and followed by manual
IPLs?
    Does the problem happen then?

> Failed restart . 3 worked this is the 4th.
> 
> Tracing active at IPL                                         
>                                                                   
> 00:  -> 000020D0  MVC   D2FF20001000 >> 00010480    00001000  
>   CC 0                                                                
> 00: B                                                         

I wish I'd asked you to display 10480 and 1000 here.  The 1000 location is
where the bootloader reads the command line into before moving it over to
its final resting place.  Man, if it's screwed up at this point then
something really strange is afoot.  There's not really any "fancy" stuff
going on up to this point and, since it worked 3 times before, it should
have worked the 4th time too.

>                                                                       
> 00:  -> 00010000  BASR  0DD0        CC 0                      
>                                                                       
> 00: D TX10480.16                                              
>                                                                       
> 00: R00010480  5B47656E 6572616C 5D0A556E 69717565 06 
> *�General�.Unique*                                            
>                 
> 00: R00010490  49443D31 6851342E 48624B43 68665276    
> *ID=1hQ4.HbKChfRv*                                            
>                 

So we now know it's happening fairly early on.  Since I don't have a solid
answer, let's do some brainstorming...

"Minidisk cache hosed or CCW interpretation screwed up"
VERY unlikely.  Probably should say "Not possible" since it's so unlikely.

"Corrupt disk"
If the disk were corrupted immediately after loading from tape, then the
first IPL would have failed.  If it is getting corrupted after the IPL, then
it should happen after the first IPL and how likely is it that the same data
(possible, but odd) is involved.

However, if you DO have to reload the disks after the failure, then I would
suspect

"ReIPL code not resetting something"
I reckon that if the ReIPL code didn't reset some control register we could
be grabbing a stale page from somewhere else in storage, but that'd be
unlikely since a couple of reipls worked.

"Some sort of I/O failure"
I can't think of what it would be though.

"Bogus pointer usage"
If the block read from disk and moved to 10480 is okay, then somewhere
between that point and the 0x10000 break could be overlaying 10480.  But,
the "TR STO..." command should have trapped that.  This is all ZIPL code to
the 0x10000 point.

"Old kernel"
Actually, this would be one of the first things you could try.  I noticed
that you last compiled your kernel back in November of 2002.  I just went
back through the IBM patch descriptions and, while I don't fully understand
the implications of most of them, I can imagine that one of the bugs could
be your problem.  And this doesn't include whatever SuSE would have included
in later kernel levels.

Anybody else have any ideas?

Leland

Reply via email to