Re: vmrelocate and quiescence time

2020-08-15 Thread Alan Altmark
On Saturday, 08/15/2020 at 02:25 GMT, Grzegorz Powiedziuk 
 wrote:
> Thank you for verification! If timestamps are correct then this step
> literally takes a very brief moment. So I suspect that that final memory
> pass takes so much time, or am I wrong?
> So how much usually does it take to vmrelocate a 100-200G vm?  Just an
> estimate ... do I have to worry with my 30 seconds?

As you've noted "it depends".  A 30 second relocation doesn't bother me, 
but a 30 second quiesce time does.

For Very Large Guests, it's typically better to use application clusters 
so that you don't need to move the guests.  Just take one down and let the 
other members take the load.  E.g. I wouldn't use LGR on an Oracle db, 
given it's failover capabilities.  Let the standby Oracle instance take 
over.

Alan Altmark

Senior Managing z/VM and Linux Consultant
IBM Systems Lab Services
IBM Z Delivery Practice
ibm.com/systems/services/labservices
office: 607.429.3323
mobile; 607.321.7556
alan_altm...@us.ibm.com
IBM Endicott


--
For LINUX-390 subscribe / signoff / archive access instructions,
send email to lists...@vm.marist.edu with the message: INFO LINUX-390 or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390


Re: vmrelocate and quiescence time

2020-08-15 Thread Alan Altmark
On Saturday, 08/15/2020 at 02:25 GMT, Grzegorz Powiedziuk 
 wrote:
> On Sat, Aug 15, 2020 at 5:00 AM Alan Altmark 
> wrote:
>
> >
> > Are you using the IMMEDIATE option on VMRELOCATE?  I ask because the
> > default MAXQUIESCE on the VMRELOCATE without IMMEDIATE is 10 seconds. 
With
> > IMMEDIATE.
> >
> > Forgot to mention - I have to specify longer MAXQUIESCE because the
> default 10s was too short. And no, I am not doing IMMEDIATE.

We chose 10 seconds as the default because longer than that tends to cause 
applications to get upset, as you have discovered.  You said you were 
using virtual CTCs, so that means you're in the same LPAR, not just same 
CPC.

Keep in mind that you have two 2nd level systems vying for CPU, and all of 
that I/O is simulated by the 1st level system.  Not a great performer for 
LGR.

Consider letting your 2nd level systems use real CTCs.  I wouldn't use 2nd 
level LGR to predict the performance of 1st level LGR.

Alan Altmark

Senior Managing z/VM and Linux Consultant
IBM Systems Lab Services
IBM Z Delivery Practice
ibm.com/systems/services/labservices
office: 607.429.3323
mobile; 607.321.7556
alan_altm...@us.ibm.com
IBM Endicott


--
For LINUX-390 subscribe / signoff / archive access instructions,
send email to lists...@vm.marist.edu with the message: INFO LINUX-390 or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390


ancient history - MP3000 CTC network on SLES9

2020-08-15 Thread r.stricklin
Esteemed colleagues -

I am attempting to bring up SLES9 in an LPAR on an MP3000. I have some 
reasonable experience with SLES9 on s390 from many years ago, but my experience 
then was with VM, OSA Express, and VSWITCH. So some of this is new to me, and 
I'm struggling.

I have the install working (to some degree) with the MP3000 emulated 
LCS3174/MPTN arrangement, but I'm having trouble with attempts to bring the 
(emulated) 3390 volumes online hanging. The first attempt to `echo 1 
>/sys/bus/ccw/devices/0.0.0980/online` blocks and never returns, even though 
there are messages from the kernel that indicate the device has been probed (it 
lists geometry, etc.). Subsequent attempts return immediately, but the value of 
`online` always stays 0. I noticed the kernel task 'regipm' (associated with 
the LCS device) constantly chewing all of one CPU (apparently due to a 
multicast bug that this SLES9 kernel is just a couple patches short of 
including a fix for) and I would like to eliminate this as a potential cause of 
the dasd problems (however unlikely). So I thought I'd try with a CTC network, 
instead. And this is not working at all.

I am trying to create a CTC network between SLES9 in partition 3 (IZXL) and 
VM/ESA V2R4 TCPIP in partition 2 (IZXD). VM has an already-functioning ethernet 
IP link via the emulated LCS.

   devs (r,w)ip  peer
VM:2331,2330 192.168.176.179 192.168.176.79
Linux: 2320,2321 192.168.176.79  192.168.176.179

SLES9 will configure the CTC but fails to ping its peer, and the install can't 
proceed. I'm not sure what I'm missing; it may likely be some EMIF CTC detail. 
My incomplete understanding is that 2320/2330 should be either end of one CTC 
connecting LPARs 2 and 3, and nothing more need be done to connect them, but 
I'm not really sure.


Relevant IOCDF entries for the EMIF CTC definitions -

   CHPID PATH=(10),TYPE=CTC,SHARED,PART=((IZOD,IZXD,IZXL),(=)) 
   CHPID PATH=(11),TYPE=CNC,SHARED,PART=((IZOD,IZXD,IZXL),(=)) 
   CNTLUNIT CUNUMBER=232F,PATH=(10),CUADD=2,UNIT=SCTC, X
 UNITADD=((00,16))  
   CNTLUNIT CUNUMBER=233E,PATH=(11),CUADD=3,UNIT=SCTC, X
 UNITADD=((00,16))  
 IODEVICE ADDRESS=(2320,16),UNIT=SCTC,CUNUMBR=(232F),  X
 UNITADD=00,STADET=Y,PART=(IZXL)
 IODEVICE ADDRESS=(2330,16),UNIT=SCTC,CUNUMBR=(233E),  X
 UNITADD=00,STADET=Y,PART=(IZOD,IZXD)   

Relevant data from the VM side - 
   q ctc
   CTCA 0E20 ATTACHED TO TCPIP0E20
   CTCA 0E21 ATTACHED TO TCPIP0E21
   CTCA 2330 ATTACHED TO TCPIP2330
   CTCA 2331 ATTACHED TO TCPIP2331
   Ready; T=0.01/0.01 08:57:53

PROFILE TCPIP (excerpted) -
= DEVICE UNIT0 LCS E20
= LINK MPTN2 ETHERNET 2 UNIT0
= DEVICE IZXL CTC 2330
= LINK IZXL2320 CTC 1 IZXL   /* I've tried this with both CTC 1 and CTC 0 */
= HOME
=192.168.5.78  MPTN2
=192.168.176.179  IZXL2320
= GATEWAY
=   192.168.0.0=MPTN1 DEFAULTSIZE  0.0.255.0   0.0.5.0
= ; IZXL CTC
=   192.168.176.79 =IZXL2320  1500 HOST
= START UNIT0
= START IZXL2320

Linux console log (exerpted) -

   Please select the type of your network device:
   4) Channel to Channel
   Enter your choice (0-10):
4
   Loading CTC module:
   CTC driver Version: 1.58.2.1  initialized
   List of first 10 CTC Channels that were detected:
   Device   Channel type
   0.0.0e20 3088/01
   0.0.0e21 3088/01
   0.0.2310 3088/1f
   0.0.2311 3088/1f
   Device address for read channel (0.0.0e20):
0.0.2320
   Device address for write channel (0.0.0e21):
0.0.2321
   Select protocol number for CTC:
   0) Compatibility mode, also for non-Linux peers other
  than OS/3909 and z/OS (this is the default mode)
   1) Extended mode
   3) Compatibility mode with OS/390 and z/OS
   Enter your choice (0):
0
   ctc0: read: ch-0.0.2320, write: ch-0.0.2321, proto: 0
   ctc0 detected.
   ctc0 is available, continuing with network setup.
   ifconfig ctc0 192.168.176.79 pointopoint 192.168.176.179 mtu 1500
   Trying to ping my IP address:
   PING 192.168.176.79 (192.168.176.79) 56(84) bytes of data.
   64 bytes from 192.168.176.79: icmp_seq=1 ttl=64 time=0.171ms
   3 packets transmitted, 3 received, 0% packet loss, time 1998ms
   Waiting 6 seconds for connection with remote side.
   Waiting 3 seconds for connection with remote side.
   Waiting 4 seconds for connection with remote side.
   Waiting 5 seconds for connection with remote side.
   Waiting 4 seconds for connection with remote side.
   Waiting 4 seconds for connection with remote side.
   Waiting 3 seconds for connection with remote side.
   Waiting 2 seconds for connection with remote side.
   

Re: sles15sp2 ipl exception

2020-08-15 Thread Juha Vuori

Follow-up on this:

Upgrading s390-tools to level 2.11.0-9.6.1 fixed the boot problem in our 
upgraded sles15.2 system.
So most probably we were hit by:
    - zipl: check for valid ipl parmblock lowcore pointer (bsc#1174310)
The patch SUSE-SLE-Module-Basesystem-15-SP2-2020-2227 fixes that, and it was released in the SUSE 
public repos a couple of days ago.


I don't know (yet) if the problem concerns also sles15.2 systems built from scratch, but for 
sles15.2 system upgraded from earlier SP levels this fix could be essential.


--
Best regards,
Juha Vuori

On 28.7.2020 10.46, Stefan Haberland wrote:

Hi Juha,

thanks for the data. I have tested locally with basically the same setup.

Kernel 5.3.18-22-default
zipl: zSeries Initial Program Loader version 2.11.0-7.27

I have seen the problem once but for me it disappeared after IPL using
clear.

After reviewing the code I think there is a chance that one of the known
bugs may be causing your problem.
It is caused by the IPL code wrongly reading (random) data from low
memory which causes problems if there is some data in it.

If you have a SUSE support contract you may open a case with them and
reference SUSE Bugzilla 1174310.
There is a chance that you can get a s390-tools test rpm with the fix
included.

Regards,
Stefan

Am 27.07.20 um 22:35 schrieb Juha Vuori:

Hi Stefan,

IPL by profile.exec was the very same:
'cp ipl 300 clear'
but here's another interactively:

q v dasd
DASD 0190 3390 710RES R/O    214 CYL ON DASD  A640 SUBCHANNEL = 000D
DASD 0191 3390 VMAUS1 R/W    100 CYL ON DASD  A443 SUBCHANNEL = 
DASD 019D 3390 710RES R/O    292 CYL ON DASD  A640 SUBCHANNEL = 000E
DASD 019E 3390 710RES R/O    500 CYL ON DASD  A640 SUBCHANNEL = 000F
DASD 0300 9336 VASC03 R/W    1024000 BLK ON DASD  F003 SUBCHANNEL = 0001
DASD 0301 9336 VASC00 R/W   10485760 BLK ON DASD  F000 SUBCHANNEL = 0002
DASD 0390 9336 VASC02 R/W    2097152 BLK ON DASD  F002 SUBCHANNEL = 0003
DASD 0502 3390 VMAUS1 R/O  5 CYL ON DASD  A443 SUBCHANNEL = 0010
Ready; T=0.01/0.01 23:13:58

CP I 300 CLEAR
Booting default (grub2)
HCPGIR450W CP entered; disabled wait PSW 0002 8000 
A0DC

No other log output.

300 contains /boot in ext4 fs.
Everything looks ok in there to me.
Migration seem to have installed kernel 5.3.18-22.
300 mounted in another linux:

zlnx002:/mnt/sles15gi/boot # ll *5.3*
-rw-r--r-- 1 root root 3038666 Jun  6 21:06 System.map-5.3.18-22-default
-rw-r--r-- 1 root root   96492 Jun  6 20:42 config-5.3.18-22-default
-rw-r--r-- 1 root root 6259256 Jun  6 21:57 image-5.3.18-22-default
-rw--- 1 root root 9892492 Jul 25 19:57 initrd-5.3.18-22-default
-rw-r--r-- 1 root root  218167 Jun  6 21:18 symvers-5.3.18-22-default.gz
-rw-r--r-- 1 root root 377 Jun  6 21:18 sysctl.conf-5.3.18-22-default
-rw-r--r-- 1 root root 7092916 Jun  6 21:33 vmlinux-5.3.18-22-default.gz

Migration was done from local sles15sp2 rmt repositories, which were
newly refreshed (in last Saturday IIRC).

Regards,
Juha

On 27.7.2020 17.56, Stefan Haberland wrote:

Hi Juha,

there have been some fixes in the IPL code recently that will be
available for SLES15.2 with one of the next updates.
But I can not say definitely that one of them is causing your issue.

Are there any additional messages on the screen?

How did you IPL the system?

Could you please try to IPL with clearing the memory? Just issue

#cp i  clear

where  is your device number and clear sets the contents of your
virtual machine's storage to binary zeros before the operating system is
loaded.


Regards,
Stefan

Am 27.07.20 um 13:17 schrieb Juha Vuori:

Hi,

Before opening a SR, I'd check if this is a known problem:

After migrating a SLES15 SP1 server to SP2, its IPL fails with

HCPGIR450W CP entered; disabled wait PSW 0002 8000 
A0DC

z13s, z/VM 7.1, linux disks: EDEV mdisks

--
Best regards,
Juha Vuori

--
For LINUX-390 subscribe / signoff / archive access instructions,
send email to lists...@vm.marist.edu with the message: INFO LINUX-390
or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390

--
For LINUX-390 subscribe / signoff / archive access instructions,
send email to lists...@vm.marist.edu with the message: INFO LINUX-390
or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390

--
For LINUX-390 subscribe / signoff / archive access instructions,
send email to lists...@vm.marist.edu with the message: INFO LINUX-390
or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390

--
For LINUX-390 subscribe / signoff / archive access instructions,
send email to lists...@vm.marist.edu with the message: INFO LINUX-390 or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390



Re: vmrelocate and quiescence time

2020-08-15 Thread Grzegorz Powiedziuk
On Fri, Aug 14, 2020 at 4:45 PM Scott Rohling 
wrote:

> One key question is whether lpars are on same cec or different ones...
> virtual ctcs or "real"?
>
> Scott Rohling
>

They are on the same CEC. Virtual ctcs

--
For LINUX-390 subscribe / signoff / archive access instructions,
send email to lists...@vm.marist.edu with the message: INFO LINUX-390 or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390


Re: vmrelocate and quiescence time

2020-08-15 Thread Grzegorz Powiedziuk
On Sat, Aug 15, 2020 at 5:00 AM Alan Altmark 
wrote:

>
> Are you using the IMMEDIATE option on VMRELOCATE?  I ask because the
> default MAXQUIESCE on the VMRELOCATE without IMMEDIATE is 10 seconds. With
> IMMEDIATE.
>
> Forgot to mention - I have to specify longer MAXQUIESCE because the
default 10s was too short. And no, I am not doing IMMEDIATE.
Although I have to force storage because we have a "MAX" storage parameter
of the VM set much higher than we need right now (in case we need to add
more dynamically).
And the "MAX" is higher than total paging space on either of the systems.
We have plenty of memory available on both ends.



> Yes, that's how it works.  It forces OSA and FCP connections to be rebuilt
> with the correct parameters.
>

Thank you for verification! If timestamps are correct then this step
literally takes a very brief moment. So I suspect that that final memory
pass takes so much time, or am I wrong?
So how much usually does it take to vmrelocate a 100-200G vm?  Just an
estimate ... do I have to worry with my 30 seconds?
Thanks Alan

Gregory

--
For LINUX-390 subscribe / signoff / archive access instructions,
send email to lists...@vm.marist.edu with the message: INFO LINUX-390 or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390


Re: vmrelocate and quiescence time

2020-08-15 Thread Alan Altmark
On Friday, 08/14/2020 at 08:28 GMT, Grzegorz Powiedziuk 
 wrote:
> Hello,
> From your experience, during the relocation from one LPAR to another, 
how
> long on average the quiescence period is (no network, no nothing)?
> I understand that it depends on many factors, but I am just asking for 
some
> examples.
> From my understanding the quiescence is one of the last steps, right 
before
> the final pass of memory which should be as small as possible. I 
understand
> that if there is a lot happening in the VM then the last pass can also 
be
> quite big. But still ...
>
> Anyway, I have a rhel VM with 128GB of memory running DB2. Currently not
> prod so CPUs are mostly idle. No paging, no swapping, not much traffic 
and
> the quiescence takes  20-30 seconds which seems a lot (the vmrelocate 
takes
> a couple of minutes in total) . It is causing db2 and ssh sessions to
> timeout.
> I am wondering if that is normal or we have something misconfigured.

Are you using the IMMEDIATE option on VMRELOCATE?  I ask because the 
default MAXQUIESCE on the VMRELOCATE without IMMEDIATE is 10 seconds. With 
IMMEDIATE.

> During the last phase the linux kernel sends error messages "QDIO 
problem
> occurred" for each FCP and some QETH errors  and they are all followed 
with
> recoveries. All these are being thrown at the same time right after
> quiescence ends (I think). Which kind of makes sense to me.

Yes, that's how it works.  It forces OSA and FCP connections to be rebuilt 
with the correct parameters.

Alan Altmark

Senior Managing z/VM and Linux Consultant
IBM Systems Lab Services
IBM Z Delivery Practice
ibm.com/systems/services/labservices
office: 607.429.3323
mobile; 607.321.7556
alan_altm...@us.ibm.com
IBM Endicott


--
For LINUX-390 subscribe / signoff / archive access instructions,
send email to lists...@vm.marist.edu with the message: INFO LINUX-390 or visit
http://www2.marist.edu/htbin/wlvindex?LINUX-390