Re: [Ocfs2-users] OCFS2 cluster won't come up and stay up

2011-12-02 Thread Tony Rios
Bug #1338 has been created.

http://oss.oracle.com/bugzilla/show_bug.cgi?id=1338


On Dec 1, 2011, at 6:36 PM, Sunil Mushran wrote:

 To analyze one needs the logs. And a bugzilla is a good place holder for the 
 logs. 
 
 On Dec 1, 2011, at 6:05 PM, Tony Rios t...@tonyrios.com wrote:
 
 Sunil,
 Is submitting a bug report the only answer?
 I'm happy to send in this information, but can I take the cluster down 
 entirely and sort of reset it so we can get these servers back online and 
 talking again in the meanwhile?
 Tony
 
 On Dec 1, 2011, at 5:05 PM, Sunil Mushran wrote:
 
 Node 3 is joining the domain. It is having problms getting the superblock 
 cluster lock.
 Create a bugzilla on oss.oracle.com and attach the /var/logs/messages from 
 all nodes.
 If you have netconsole setup, attach those logs too.
 
 On 12/01/2011 04:55 PM, Tony Rios wrote:
 I'm having an issue today where I just can't seem to keep all the servers 
 in the cluster online.
 They aren't losing network connectivity and I can ping the iSCSI host just 
 fine and the host is logged in.
 
 These are the errors form the dmesg when I try to mount the filesystem:
 
 root@pedge36:~# dmesg
 [0.00] Initializing cgroup subsys cpuset
 [0.00] Initializing cgroup subsys cpu
 [0.00] Linux version 2.6.38-10-generic (buildd@yellow) (gcc 
 version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4) ) #46-Ubuntu SMP Tue Jun 28 
 15:07:17 UTC 2011 (Ubuntu 2.6.38-10.46-generic 2.6.38.7)
 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-2.6.38-10-generic 
 root=UUID=3cd859b8-2605-4a38-8767-a6d1f99d53bd ro debug ignore_loglevel
 [0.00] BIOS-provided physical RAM map:
 [0.00]  BIOS-e820:  - 000a (usable)
 [0.00]  BIOS-e820: 0010 - effc (usable)
 [0.00]  BIOS-e820: effc - effcfc00 (ACPI data)
 [0.00]  BIOS-e820: effcfc00 - e000 (reserved)
 [0.00]  BIOS-e820: f000 - f400 (reserved)
 [0.00]  BIOS-e820: fec0 - fed00400 (reserved)
 [0.00]  BIOS-e820: fed13000 - feda (reserved)
 [0.00]  BIOS-e820: fee0 - fee1 (reserved)
 [0.00]  BIOS-e820: ffb0 - 0001 (reserved)
 [0.00]  BIOS-e820: 0001 - 0001e000 (usable)
 [0.00]  BIOS-e820: 0001e000 - 0002 (reserved)
 [0.00]  BIOS-e820: 0002 - 00021000 (usable)
 [0.00] debug: ignoring loglevel setting.
 [0.00] NX (Execute Disable) protection: active
 [0.00] DMI 2.3 present.
 [0.00] DMI: Dell Computer Corporation PowerEdge 850/0Y8628, BIOS 
 A04 08/22/2006
 [0.00] e820 update range:  - 0001 
 (usable) ==  (reserved)
 [0.00] e820 remove range: 000a - 0010 
 (usable)
 [0.00] No AGP bridge found
 [0.00] last_pfn = 0x21 max_arch_pfn = 0x4
 [0.00] MTRR default type: uncachable
 [0.00] MTRR fixed ranges enabled:
 [0.00]   0-9 write-back
 [0.00]   A-B uncachable
 [0.00]   C-CBFFF write-protect
 [0.00]   CC000-EBFFF uncachable
 [0.00]   EC000-F write-protect
 [0.00] MTRR variable ranges enabled:
 [0.00]   0 base 0 mask E write-back
 [0.00]   1 base 2 mask FF000 write-back
 [0.00]   2 base 0F000 mask FF000 uncachable
 [0.00]   3 disabled
 [0.00]   4 disabled
 [0.00]   5 disabled
 [0.00]   6 disabled
 [0.00]   7 disabled
 [0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 
 0x7010600070106
 [0.00] e820 update range: f000 - 0001 
 (usable) ==  (reserved)
 [0.00] last_pfn = 0xeffc0 max_arch_pfn = 0x4
 [0.00] found SMP MP-table at [880fe710] fe710
 [0.00] initial memory mapped : 0 - 2000
 [0.00] init_memory_mapping: -effc
 [0.00]  00 - 00efe0 page 2M
 [0.00]  00efe0 - 00effc page 4k
 [0.00] kernel direct mapping tables up to effc @ 
 1fffa000-2000
 [0.00] init_memory_mapping: 0001-00021000
 [0.00]  01 - 021000 page 2M
 [0.00] kernel direct mapping tables up to 21000 @ 
 effb6000-effc
 [0.00] RAMDISK: 366d - 3736
 [0.00] ACPI: RSDP 000fd160 00014 (v00 DELL  )
 [0.00] ACPI: RSDT 000fd174 00038 (v01 DELL   PE850
 0001 MSFT 010A)
 [0.00] ACPI: FACP 000fd1b8 00074 (v01 DELL   PE850
 0001 MSFT 010A)
 [0.00] ACPI: DSDT effc 01C19 (v01 DELL   PE830
 0001 MSFT 010E)
 [0.00] ACPI: FACS effcfc00 

Re: [Ocfs2-users] OCFS2 cluster won't come up and stay up

2011-12-02 Thread Tony Rios
Sunil, in an essence of getting everything back online, I powered down every 
single node.

I powered up 1 of the nodes that seemed to be able to mount the filesystem.
Ran an fsck on the filesystem before allowing it to be mounted.
It complained that some of the nodes unmounted cleanly, but set the clean flag 
after a couple seconds.
I re-ran fsck once more and it came up clean with no warnings or errors.
I then mounted this server and it didn't complain at all.
I am now in the process of bringing online one server at a time, so far the 
first 4 have no complained at all.
So we are back up and running, but hopefully the logs could still provide some 
useful information as well.

Tony

On Dec 1, 2011, at 6:36 PM, Sunil Mushran wrote:

 To analyze one needs the logs. And a bugzilla is a good place holder for the 
 logs. 
 
 On Dec 1, 2011, at 6:05 PM, Tony Rios t...@tonyrios.com wrote:
 
 Sunil,
 Is submitting a bug report the only answer?
 I'm happy to send in this information, but can I take the cluster down 
 entirely and sort of reset it so we can get these servers back online and 
 talking again in the meanwhile?
 Tony
 
 On Dec 1, 2011, at 5:05 PM, Sunil Mushran wrote:
 
 Node 3 is joining the domain. It is having problms getting the superblock 
 cluster lock.
 Create a bugzilla on oss.oracle.com and attach the /var/logs/messages from 
 all nodes.
 If you have netconsole setup, attach those logs too.
 
 On 12/01/2011 04:55 PM, Tony Rios wrote:
 I'm having an issue today where I just can't seem to keep all the servers 
 in the cluster online.
 They aren't losing network connectivity and I can ping the iSCSI host just 
 fine and the host is logged in.
 
 These are the errors form the dmesg when I try to mount the filesystem:
 
 root@pedge36:~# dmesg
 [0.00] Initializing cgroup subsys cpuset
 [0.00] Initializing cgroup subsys cpu
 [0.00] Linux version 2.6.38-10-generic (buildd@yellow) (gcc 
 version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4) ) #46-Ubuntu SMP Tue Jun 28 
 15:07:17 UTC 2011 (Ubuntu 2.6.38-10.46-generic 2.6.38.7)
 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-2.6.38-10-generic 
 root=UUID=3cd859b8-2605-4a38-8767-a6d1f99d53bd ro debug ignore_loglevel
 [0.00] BIOS-provided physical RAM map:
 [0.00]  BIOS-e820:  - 000a (usable)
 [0.00]  BIOS-e820: 0010 - effc (usable)
 [0.00]  BIOS-e820: effc - effcfc00 (ACPI data)
 [0.00]  BIOS-e820: effcfc00 - e000 (reserved)
 [0.00]  BIOS-e820: f000 - f400 (reserved)
 [0.00]  BIOS-e820: fec0 - fed00400 (reserved)
 [0.00]  BIOS-e820: fed13000 - feda (reserved)
 [0.00]  BIOS-e820: fee0 - fee1 (reserved)
 [0.00]  BIOS-e820: ffb0 - 0001 (reserved)
 [0.00]  BIOS-e820: 0001 - 0001e000 (usable)
 [0.00]  BIOS-e820: 0001e000 - 0002 (reserved)
 [0.00]  BIOS-e820: 0002 - 00021000 (usable)
 [0.00] debug: ignoring loglevel setting.
 [0.00] NX (Execute Disable) protection: active
 [0.00] DMI 2.3 present.
 [0.00] DMI: Dell Computer Corporation PowerEdge 850/0Y8628, BIOS 
 A04 08/22/2006
 [0.00] e820 update range:  - 0001 
 (usable) ==  (reserved)
 [0.00] e820 remove range: 000a - 0010 
 (usable)
 [0.00] No AGP bridge found
 [0.00] last_pfn = 0x21 max_arch_pfn = 0x4
 [0.00] MTRR default type: uncachable
 [0.00] MTRR fixed ranges enabled:
 [0.00]   0-9 write-back
 [0.00]   A-B uncachable
 [0.00]   C-CBFFF write-protect
 [0.00]   CC000-EBFFF uncachable
 [0.00]   EC000-F write-protect
 [0.00] MTRR variable ranges enabled:
 [0.00]   0 base 0 mask E write-back
 [0.00]   1 base 2 mask FF000 write-back
 [0.00]   2 base 0F000 mask FF000 uncachable
 [0.00]   3 disabled
 [0.00]   4 disabled
 [0.00]   5 disabled
 [0.00]   6 disabled
 [0.00]   7 disabled
 [0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 
 0x7010600070106
 [0.00] e820 update range: f000 - 0001 
 (usable) ==  (reserved)
 [0.00] last_pfn = 0xeffc0 max_arch_pfn = 0x4
 [0.00] found SMP MP-table at [880fe710] fe710
 [0.00] initial memory mapped : 0 - 2000
 [0.00] init_memory_mapping: -effc
 [0.00]  00 - 00efe0 page 2M
 [0.00]  00efe0 - 00effc page 4k
 [0.00] kernel direct mapping tables up to effc @ 
 1fffa000-2000
 [0.00] init_memory_mapping: 

Re: [Ocfs2-users] OCFS2 cluster won't come up and stay up

2011-12-01 Thread Sunil Mushran
To analyze one needs the logs. And a bugzilla is a good place holder for the 
logs. 

On Dec 1, 2011, at 6:05 PM, Tony Rios t...@tonyrios.com wrote:

 Sunil,
 Is submitting a bug report the only answer?
 I'm happy to send in this information, but can I take the cluster down 
 entirely and sort of reset it so we can get these servers back online and 
 talking again in the meanwhile?
 Tony
 
 On Dec 1, 2011, at 5:05 PM, Sunil Mushran wrote:
 
 Node 3 is joining the domain. It is having problms getting the superblock 
 cluster lock.
 Create a bugzilla on oss.oracle.com and attach the /var/logs/messages from 
 all nodes.
 If you have netconsole setup, attach those logs too.
 
 On 12/01/2011 04:55 PM, Tony Rios wrote:
 I'm having an issue today where I just can't seem to keep all the servers 
 in the cluster online.
 They aren't losing network connectivity and I can ping the iSCSI host just 
 fine and the host is logged in.
 
 These are the errors form the dmesg when I try to mount the filesystem:
 
 root@pedge36:~# dmesg
 [0.00] Initializing cgroup subsys cpuset
 [0.00] Initializing cgroup subsys cpu
 [0.00] Linux version 2.6.38-10-generic (buildd@yellow) (gcc version 
 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4) ) #46-Ubuntu SMP Tue Jun 28 15:07:17 
 UTC 2011 (Ubuntu 2.6.38-10.46-generic 2.6.38.7)
 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-2.6.38-10-generic 
 root=UUID=3cd859b8-2605-4a38-8767-a6d1f99d53bd ro debug ignore_loglevel
 [0.00] BIOS-provided physical RAM map:
 [0.00]  BIOS-e820:  - 000a (usable)
 [0.00]  BIOS-e820: 0010 - effc (usable)
 [0.00]  BIOS-e820: effc - effcfc00 (ACPI data)
 [0.00]  BIOS-e820: effcfc00 - e000 (reserved)
 [0.00]  BIOS-e820: f000 - f400 (reserved)
 [0.00]  BIOS-e820: fec0 - fed00400 (reserved)
 [0.00]  BIOS-e820: fed13000 - feda (reserved)
 [0.00]  BIOS-e820: fee0 - fee1 (reserved)
 [0.00]  BIOS-e820: ffb0 - 0001 (reserved)
 [0.00]  BIOS-e820: 0001 - 0001e000 (usable)
 [0.00]  BIOS-e820: 0001e000 - 0002 (reserved)
 [0.00]  BIOS-e820: 0002 - 00021000 (usable)
 [0.00] debug: ignoring loglevel setting.
 [0.00] NX (Execute Disable) protection: active
 [0.00] DMI 2.3 present.
 [0.00] DMI: Dell Computer Corporation PowerEdge 850/0Y8628, BIOS 
 A04 08/22/2006
 [0.00] e820 update range:  - 0001 
 (usable) ==  (reserved)
 [0.00] e820 remove range: 000a - 0010 
 (usable)
 [0.00] No AGP bridge found
 [0.00] last_pfn = 0x21 max_arch_pfn = 0x4
 [0.00] MTRR default type: uncachable
 [0.00] MTRR fixed ranges enabled:
 [0.00]   0-9 write-back
 [0.00]   A-B uncachable
 [0.00]   C-CBFFF write-protect
 [0.00]   CC000-EBFFF uncachable
 [0.00]   EC000-F write-protect
 [0.00] MTRR variable ranges enabled:
 [0.00]   0 base 0 mask E write-back
 [0.00]   1 base 2 mask FF000 write-back
 [0.00]   2 base 0F000 mask FF000 uncachable
 [0.00]   3 disabled
 [0.00]   4 disabled
 [0.00]   5 disabled
 [0.00]   6 disabled
 [0.00]   7 disabled
 [0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 
 0x7010600070106
 [0.00] e820 update range: f000 - 0001 
 (usable) ==  (reserved)
 [0.00] last_pfn = 0xeffc0 max_arch_pfn = 0x4
 [0.00] found SMP MP-table at [880fe710] fe710
 [0.00] initial memory mapped : 0 - 2000
 [0.00] init_memory_mapping: -effc
 [0.00]  00 - 00efe0 page 2M
 [0.00]  00efe0 - 00effc page 4k
 [0.00] kernel direct mapping tables up to effc @ 
 1fffa000-2000
 [0.00] init_memory_mapping: 0001-00021000
 [0.00]  01 - 021000 page 2M
 [0.00] kernel direct mapping tables up to 21000 @ 
 effb6000-effc
 [0.00] RAMDISK: 366d - 3736
 [0.00] ACPI: RSDP 000fd160 00014 (v00 DELL  )
 [0.00] ACPI: RSDT 000fd174 00038 (v01 DELL   PE850
 0001 MSFT 010A)
 [0.00] ACPI: FACP 000fd1b8 00074 (v01 DELL   PE850
 0001 MSFT 010A)
 [0.00] ACPI: DSDT effc 01C19 (v01 DELL   PE830
 0001 MSFT 010E)
 [0.00] ACPI: FACS effcfc00 00040
 [0.00] ACPI: APIC 000fd22c 00074 (v01 DELL   PE850
 0001 MSFT 010A)
 [0.00] ACPI: SPCR 

Re: [Ocfs2-users] OCFS2 cluster won't come up and stay up

2011-12-01 Thread Tony Rios
Fair enough, will do :-)

On Dec 1, 2011, at 6:36 PM, Sunil Mushran sunil.mush...@oracle.com wrote:

 To analyze one needs the logs. And a bugzilla is a good place holder for the 
 logs. 
 
 On Dec 1, 2011, at 6:05 PM, Tony Rios t...@tonyrios.com wrote:
 
 Sunil,
 Is submitting a bug report the only answer?
 I'm happy to send in this information, but can I take the cluster down 
 entirely and sort of reset it so we can get these servers back online and 
 talking again in the meanwhile?
 Tony
 
 On Dec 1, 2011, at 5:05 PM, Sunil Mushran wrote:
 
 Node 3 is joining the domain. It is having problms getting the superblock 
 cluster lock.
 Create a bugzilla on oss.oracle.com and attach the /var/logs/messages from 
 all nodes.
 If you have netconsole setup, attach those logs too.
 
 On 12/01/2011 04:55 PM, Tony Rios wrote:
 I'm having an issue today where I just can't seem to keep all the servers 
 in the cluster online.
 They aren't losing network connectivity and I can ping the iSCSI host just 
 fine and the host is logged in.
 
 These are the errors form the dmesg when I try to mount the filesystem:
 
 root@pedge36:~# dmesg
 [0.00] Initializing cgroup subsys cpuset
 [0.00] Initializing cgroup subsys cpu
 [0.00] Linux version 2.6.38-10-generic (buildd@yellow) (gcc 
 version 4.5.2 (Ubuntu/Linaro 4.5.2-8ubuntu4) ) #46-Ubuntu SMP Tue Jun 28 
 15:07:17 UTC 2011 (Ubuntu 2.6.38-10.46-generic 2.6.38.7)
 [0.00] Command line: BOOT_IMAGE=/boot/vmlinuz-2.6.38-10-generic 
 root=UUID=3cd859b8-2605-4a38-8767-a6d1f99d53bd ro debug ignore_loglevel
 [0.00] BIOS-provided physical RAM map:
 [0.00]  BIOS-e820:  - 000a (usable)
 [0.00]  BIOS-e820: 0010 - effc (usable)
 [0.00]  BIOS-e820: effc - effcfc00 (ACPI data)
 [0.00]  BIOS-e820: effcfc00 - e000 (reserved)
 [0.00]  BIOS-e820: f000 - f400 (reserved)
 [0.00]  BIOS-e820: fec0 - fed00400 (reserved)
 [0.00]  BIOS-e820: fed13000 - feda (reserved)
 [0.00]  BIOS-e820: fee0 - fee1 (reserved)
 [0.00]  BIOS-e820: ffb0 - 0001 (reserved)
 [0.00]  BIOS-e820: 0001 - 0001e000 (usable)
 [0.00]  BIOS-e820: 0001e000 - 0002 (reserved)
 [0.00]  BIOS-e820: 0002 - 00021000 (usable)
 [0.00] debug: ignoring loglevel setting.
 [0.00] NX (Execute Disable) protection: active
 [0.00] DMI 2.3 present.
 [0.00] DMI: Dell Computer Corporation PowerEdge 850/0Y8628, BIOS 
 A04 08/22/2006
 [0.00] e820 update range:  - 0001 
 (usable) ==  (reserved)
 [0.00] e820 remove range: 000a - 0010 
 (usable)
 [0.00] No AGP bridge found
 [0.00] last_pfn = 0x21 max_arch_pfn = 0x4
 [0.00] MTRR default type: uncachable
 [0.00] MTRR fixed ranges enabled:
 [0.00]   0-9 write-back
 [0.00]   A-B uncachable
 [0.00]   C-CBFFF write-protect
 [0.00]   CC000-EBFFF uncachable
 [0.00]   EC000-F write-protect
 [0.00] MTRR variable ranges enabled:
 [0.00]   0 base 0 mask E write-back
 [0.00]   1 base 2 mask FF000 write-back
 [0.00]   2 base 0F000 mask FF000 uncachable
 [0.00]   3 disabled
 [0.00]   4 disabled
 [0.00]   5 disabled
 [0.00]   6 disabled
 [0.00]   7 disabled
 [0.00] x86 PAT enabled: cpu 0, old 0x7040600070406, new 
 0x7010600070106
 [0.00] e820 update range: f000 - 0001 
 (usable) ==  (reserved)
 [0.00] last_pfn = 0xeffc0 max_arch_pfn = 0x4
 [0.00] found SMP MP-table at [880fe710] fe710
 [0.00] initial memory mapped : 0 - 2000
 [0.00] init_memory_mapping: -effc
 [0.00]  00 - 00efe0 page 2M
 [0.00]  00efe0 - 00effc page 4k
 [0.00] kernel direct mapping tables up to effc @ 
 1fffa000-2000
 [0.00] init_memory_mapping: 0001-00021000
 [0.00]  01 - 021000 page 2M
 [0.00] kernel direct mapping tables up to 21000 @ 
 effb6000-effc
 [0.00] RAMDISK: 366d - 3736
 [0.00] ACPI: RSDP 000fd160 00014 (v00 DELL  )
 [0.00] ACPI: RSDT 000fd174 00038 (v01 DELL   PE850
 0001 MSFT 010A)
 [0.00] ACPI: FACP 000fd1b8 00074 (v01 DELL   PE850
 0001 MSFT 010A)
 [0.00] ACPI: DSDT effc 01C19 (v01 DELL   PE830
 0001 MSFT 010E)
 [0.00] ACPI: FACS effcfc00 00040
 [0.00] ACPI: APIC