It is having problems doing ios to the virtual devices. -5 is EIO. ke...@utahsysadmin.com wrote: > I have a relatively new test environment setup that is a little different > from your typical scenario. This is my first time using OCFS2, but I > believe it should work the way I have it setup. > > All of this is setup on VMWare virtual hosts. I have two front-end web > servers and one backend administrative server. They all share 2 virtual > hard drives within VMware (independent, persistent, & thick provisioned). > > Everything works great and the way I want, except occasionally, one of the > nodes will crash with errors like the following: > > end_request: I/O error, dev sdc, sector 585159 > Aborting journal on device sdc1 > end_request: I/O error, dev sdc, sector 528151 > Buffer I/O error on device sdc1, logical block 66011 > lost page write due to I/O error on sdc1 > (2848,1):ocfs2_start_trans:240 ERROR: status = -30 > OCFS2: abort (device sdc1): ocfs2_start_trans: Detected aborted journal > Kernel panic - not syncing: OCFS2: (device sdc1): panic forced after error > > <0>Rebooting in 30 seconds..BUG: warning at > arch/i386/kernel/smp.c:492/smp_send_reschedule() (Tainted: G ) > > The server never reboots, it just sits there until I reset it. The cluster > ran fine without errors (for a week or two) and now that I upgraded to the > latest kernel/ocfs2 it's happening almost daily. The disks are fine, it's > on a LUN on a SAN with no problems and I unmounted all the partitions and > ran fsck.ocfs2 -f on both drives from all three nodes (one at a time) and > it found no errors. > > This morning it happened again and now after a reset the server will not > boot up at all, just sits there on "Starting Oracle Cluster File System > (OCFS2)". These servers are all running OEL 5.4 with the latest patches > installed. > > Here's the setup:# cat /etc/ocfs2/cluster.conf > cluster: > node_count = 3 > name = qacluster > > node: > ip_port = 7777 > ip_address = 10.10.220.30 > number = 0 > name = qa-admin > cluster = qacluster > > node: > ip_port = 7777 > ip_address = 10.10.220.31 > number = 1 > name = qa-web1 > cluster = qacluster > > node: > ip_port = 7777 > ip_address = 10.10.220.32 > number = 2 > name = qa-web2 > cluster = qacluster > > # mounted.ocfs2 -d > Device FS UUID Label > /dev/sdb1 ocfs2 85b050a0-a381-49d8-8353-c21b1c8b28c4 data > /dev/sdc1 ocfs2 6a03e81a-8186-41a6-8fd8-dc23854e12d3 logs > > # uname -a > Linux qa-admin.domain.com 2.6.18-164.15.1.0.1.el5 #1 SMP Wed Mar 17 > 00:56:05 EDT 2010 i686 i686 i386 GNU/Linux > > # rpm -qa | grep ocfs2 > ocfs2-2.6.18-164.11.1.0.1.el5-1.4.4-1.el5 > ocfs2-tools-1.4.3-1.el5 > ocfs2-2.6.18-164.15.1.0.1.el5-1.4.4-1.el5 > > This is the latest from one of the alive hosts: > > # dmesg | tail -50 > (2869,0):ocfs2_lock_allocators:677 ERROR: status = -5 > (2869,0):__ocfs2_extend_allocation:739 ERROR: status = -5 > (2869,0):ocfs2_extend_no_holes:952 ERROR: status = -5 > (2869,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5 > (2869,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5 > (2869,0):ocfs2_write_begin:1860 ERROR: status = -5 > (2869,0):ocfs2_file_buffered_write:2039 ERROR: status = -5 > (2869,0):__ocfs2_file_aio_write:2194 ERROR: status = -5 > OCFS2: ERROR (device sdc1): ocfs2_check_group_descriptor: Group descriptor > # 1128960 has bit count 32256 but claims that 34300 are free > (2881,0):ocfs2_search_chain:1244 ERROR: status = -5 > (2881,0):ocfs2_claim_suballoc_bits:1433 ERROR: status = -5 > (2881,0):__ocfs2_claim_clusters:1715 ERROR: status = -5 > (2881,0):ocfs2_local_alloc_new_window:1013 ERROR: status = -5 > (2881,0):ocfs2_local_alloc_slide_window:1116 ERROR: status = -5 > (2881,0):ocfs2_reserve_local_alloc_bits:537 ERROR: status = -5 > (2881,0):__ocfs2_reserve_clusters:725 ERROR: status = -5 > (2881,0):ocfs2_lock_allocators:677 ERROR: status = -5 > (2881,0):__ocfs2_extend_allocation:739 ERROR: status = -5 > (2881,0):ocfs2_extend_no_holes:952 ERROR: status = -5 > (2881,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5 > (2881,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5 > (2881,0):ocfs2_write_begin:1860 ERROR: status = -5 > (2881,0):ocfs2_file_buffered_write:2039 ERROR: status = -5 > (2881,0):__ocfs2_file_aio_write:2194 ERROR: status = -5 > (2045,0):o2net_connect_expired:1664 ERROR: no connection established with > node 2 after 30.0 seconds, giving up and returning errors. > OCFS2: ERROR (device sdc1): ocfs2_check_group_descriptor: Group descriptor > # 1128960 has bit count 32256 but claims that 34300 are free > (2872,0):ocfs2_search_chain:1244 ERROR: status = -5 > (2872,0):ocfs2_claim_suballoc_bits:1433 ERROR: status = -5 > (2872,0):__ocfs2_claim_clusters:1715 ERROR: status = -5 > (2872,0):ocfs2_local_alloc_new_window:1013 ERROR: status = -5 > (2872,0):ocfs2_local_alloc_slide_window:1116 ERROR: status = -5 > (2872,0):ocfs2_reserve_local_alloc_bits:537 ERROR: status = -5 > (2872,0):__ocfs2_reserve_clusters:725 ERROR: status = -5 > (2872,0):ocfs2_lock_allocators:677 ERROR: status = -5 > (2872,0):__ocfs2_extend_allocation:739 ERROR: status = -5 > (2872,0):ocfs2_extend_no_holes:952 ERROR: status = -5 > (2872,0):ocfs2_expand_nonsparse_inode:1678 ERROR: status = -5 > (2872,0):ocfs2_write_begin_nolock:1722 ERROR: status = -5 > (2872,0):ocfs2_write_begin:1860 ERROR: status = -5 > (2872,0):ocfs2_file_buffered_write:2039 ERROR: status = -5 > (2872,0):__ocfs2_file_aio_write:2194 ERROR: status = -5 > (2065,0):ocfs2_dlm_eviction_cb:98 device (8,33): dlm has evicted node 2 > (12701,1):dlm_get_lock_resource:844 > 6A03E81A818641A68FD8DC23854E12D3:M00000000000000000000243568d3c5: at least > one node (2) to recover before lock mastery can begin > (2045,0):ocfs2_dlm_eviction_cb:98 device (8,33): dlm has evicted node 2 > (12701,1):dlm_get_lock_resource:898 > 6A03E81A818641A68FD8DC23854E12D3:M00000000000000000000243568d3c5: at least > one node (2) to recover before lock mastery can begin > o2net: accepted connection from node qa-web2 (num 2) at 147.178.220.32:7777 > ocfs2_dlm: Node 2 joins domain 6A03E81A818641A68FD8DC23854E12D3 > ocfs2_dlm: Nodes in domain ("6A03E81A818641A68FD8DC23854E12D3"): 0 1 2 > (12701,1):dlm_restart_lock_mastery:1216 node 2 up while restarting > (12701,1):dlm_wait_for_lock_mastery:1040 ERROR: status = -11 > > Any suggestions? Is there anymore data I can provide? > > Thanks for any help. > > Kevin > > > > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users@oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >
_______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com http://oss.oracle.com/mailman/listinfo/ocfs2-users