Hello all,

Has anyone experienced issues with Red Hat EL 5.6 using kernels 2.6.18-238, 
2.6.18-238.1.1 and 2.6.18-238.5.1 booting in an ESX 3.5 virtual environment?  
We are running into a condition where VMs are hanging during the initial kernel 
boot process.  I'm unable to correlate these hangs to any particular ESX-level 
event, the VMs are running on different ESX hosts and even different clusters.  
All of the issues began with the upgrade to EL 5.6 and kernel 
2.6.18-238.1.1.el5 and persists in 2.6.18-238.5.1.el5 (we skipped -238.el5).  
This has affected more than 20 hosts at this point of all different 
configurations, but always EL 5.6 VMs only.  AS4 is not affected and we don't 
have any EL6 VMs yet.  The issue is exactly the same.  During the initial 
kernel start, it gets as far as:

  PCI: Setting latency timer of device 0000:00:01.0 to 64
  NET: Registered protocol family 2
  IP route cache hash table entries: 32768 (order: 5, 131072 bytes)
  TCP established hash table entries: 131072 (order: 8, 1048576 bytes)
  TCP bind hash table entries: 65536 (order: 7, 524288 bytes)
  TCP: Hash tables configured (established 131072 bind 65536)
  TCP reno registered
  Simple Boot Flag at 0x36 set to 0x80

The next line on all VMs that boot successfully is:

  Using TSC for driving interrupts

However VMs that are hanging during boot never reach the "Using TSC..." line.  
This leads me to believe that the problem is related to the OS electing to use 
TSC as the clocksouce and that is somehow an unstable combination with ESX 3.5 
and EL 5.6 VMs.  However the issue is sporadic and I can't make this issue 
occur - simply that when an EL5.6 VM fails to boot, they all fail in the same 
place in the same way.  I've considered moving back to clocksource=acpi_pm 
divider=10 as kernel flags that was recommended for EL 5.3 and previously, but 
I'm hesitant to do that since TSC is clearly a better-performing timekeeper.

On physical hosts, even ones that use TSC, I never see a "Using TSC for driving 
interrupts" kernel message so the behavior is subtly different but I can't find 
anything in Google about this kernel message or event.

Has anyone encountered this?  Anyone able to shed light on the inner workings 
of TSC that might lead me to a solution for this (or perhaps being able to 
intelligently file a Bugzilla)?

Thanks.

--
Jason McCormick
Unix Team Lead, Systems Group, IT
Software Engineering Institute, Carnegie Mellon Univ.
E: [email protected]



_______________________________________________
rhelv5-list mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/rhelv5-list

Reply via email to