Thanks for the thorough response! > This looks like an error in the kernel. You should consult kernel/rpcrdma > folks with this error.
Mind pointing me in the direction of how to get in contact with the appropriate mailing list to begin a dialog with the kernel/rpcrdma folks? Thanks - Trey On Wed, Feb 12, 2014 at 3:04 AM, Nir Soffer <nsof...@redhat.com> wrote: > ----- Original Message ----- >> From: "Trey Dockendorf" <treyd...@gmail.com> >> To: "Itamar Heim" <ih...@redhat.com> >> Cc: "users" <users@ovirt.org> >> Sent: Monday, February 10, 2014 3:03:05 AM >> Subject: Re: [Users] Host Non-Operational from sanlock and VM fails to >> migrate >> >> No, in fact I just had the issue arise again after trying to figure >> out what about my setup causes this crash. So far it only seems to >> occur if both nodes are running NFS over RDMA, but I'm unsure if it's >> VM traffic or the host being SPM that causes it to misbehave. >> >> vm02 was running a single VM and was SPM. The crash was on vm02 >> "Invalid status on Data Center Default. Setting Data Center status to >> Non Responsive (On host vm02, Error: Network error during >> communication with the Host).". SPM successfully switched to vm01 but >> the VM is stuck in migration and unresponsive. Both engine and nodes >> using ovirt 3.3.3. >> >> vm01 and vm02 both have the following in vdsm.conf >> >> [addresses] >> management_port = 54321 >> >> [vars] >> ssl = true >> >> >> [irs] >> nfs_mount_options = rdma,port=20049 >> >> This is the ovirt NFS mount lines in /proc/mounts for each: >> >> vm01: >> >> 192.168.211.245:/tank/ovirt/import_export >> /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_import__export nfs >> rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245 >> 0 0 >> 192.168.211.245:/tank/ovirt/iso >> /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_iso nfs >> rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245 >> 0 0 >> 192.168.211.245:/tank/ovirt/data >> /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_data nfs >> rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245 >> 0 0 >> >> vm02: >> >> 192.168.211.245:/tank/ovirt/import_export >> /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_import__export nfs >> rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245 >> 0 0 >> 192.168.211.245:/tank/ovirt/iso >> /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_iso nfs >> rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245 >> 0 0 >> 192.168.211.245:/tank/ovirt/data >> /rhev/data-center/mnt/192.168.211.245:_tank_ovirt_data nfs >> rw,relatime,vers=3,rsize=32768,wsize=32768,namlen=255,hard,proto=rdma,port=20049,timeo=600,retrans=2,sec=sys,mountaddr=192.168.211.245,mountvers=3,mountproto=tcp,local_lock=none,addr=192.168.211.245 >> 0 0 >> >> The NFS server had these 2 log entries in /var/log/messages around the >> time vm02 went Non-operational. >> >> Feb 9 17:27:59 vmstore1 kernel: svcrdma: Error fast registering >> memory for xprt ffff882014683400 >> Feb 9 17:28:21 vmstore1 kernel: svcrdma: Error fast registering >> memory for xprt ffff882025bf1400 > > This looks like the root cause - failure on the storage server > > This leads to failure in the hosts connected to this storage: > > Feb 2 13:37:11 vm01 kernel: rpcrdma: connection to 192.168.211.245:20049 > closed (-103) > ... > Feb 3 07:44:31 vm01 kernel: ------------[ cut here ]------------ > Feb 3 07:44:31 vm01 kernel: WARNING: at kernel/softirq.c:159 > local_bh_enable_ip+0x7d/0xb0() (Not tainted) > Feb 3 07:44:31 vm01 kernel: Hardware name: H8DMT-IBX > Feb 3 07:44:31 vm01 kernel: Modules linked in: ebt_arp xprtrdma nfs fscache > auth_rpcgss nfs_acl bonding ebtable_nat ebtables softdog lockd sunrpc > powernow_k8 freq_table mperf > 8021q garp bridge stp llc xt_physdev ipt_REJECT nf_conntrack_ipv4 > nf_defrag_ipv4 xt_multiport iptable_filter ip_tables ip6t_REJECT > nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_ > conntrack ip6table_filter ip6_tables ib_ipoib rdma_ucm ib_ucm ib_uverbs > ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 dm_round_robin dm_multipath > vhost_net macvtap macvlan tun kvm_a > md kvm microcode serio_raw k10temp amd64_edac_mod edac_core edac_mce_amd igb > dca i2c_algo_bit ptp pps_core mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core > sg i2c_nforce2 i2c_cor > e ext4 jbd2 mbcache raid1 sd_mod crc_t10dif sata_nv ata_generic pata_acpi > pata_amd dm_mirror dm_region_hash dm_log dm_mod [last unloaded: > scsi_wait_scan] > Feb 3 07:44:31 vm01 kernel: Pid: 0, comm: swapper Not tainted > 2.6.32-431.3.1.el6.x86_64 #1 > Feb 3 07:44:31 vm01 kernel: Call Trace: > Feb 3 07:44:31 vm01 kernel: <IRQ> [<ffffffff81071e27>] ? > warn_slowpath_common+0x87/0xc0 > Feb 3 07:44:31 vm01 kernel: [<ffffffff81071e7a>] ? > warn_slowpath_null+0x1a/0x20 > Feb 3 07:44:31 vm01 kernel: [<ffffffff8107a3ed>] ? > local_bh_enable_ip+0x7d/0xb0 > Feb 3 07:44:31 vm01 kernel: [<ffffffff8152a4fb>] ? _spin_unlock_bh+0x1b/0x20 > Feb 3 07:44:31 vm01 kernel: [<ffffffffa044c4f0>] ? > rpc_wake_up_status+0x70/0x80 [sunrpc] > Feb 3 07:44:31 vm01 kernel: [<ffffffffa044579c>] ? > xprt_wake_pending_tasks+0x2c/0x30 [sunrpc] > Feb 3 07:44:31 vm01 kernel: [<ffffffffa02782fc>] ? > rpcrdma_conn_func+0x9c/0xb0 [xprtrdma] > Feb 3 07:44:31 vm01 kernel: [<ffffffff81065df0>] ? > default_wake_function+0x0/0x20 > Feb 3 07:44:31 vm01 kernel: [<ffffffffa027b450>] ? > rpcrdma_qp_async_error_upcall+0x40/0x80 [xprtrdma] > Feb 3 07:44:31 vm01 kernel: [<ffffffffa019a1cb>] ? > mlx4_ib_qp_event+0x8b/0x100 [mlx4_ib] > Feb 3 07:44:31 vm01 kernel: [<ffffffffa0138c54>] ? mlx4_qp_event+0x74/0xf0 > [mlx4_core] > Feb 3 07:44:31 vm01 kernel: [<ffffffffa01bd396>] ? igb_poll+0xb66/0x1020 > [igb] > Feb 3 07:44:31 vm01 kernel: [<ffffffffa0126057>] ? mlx4_eq_int+0x557/0xcb0 > [mlx4_core] > Feb 3 07:44:31 vm01 kernel: [<ffffffff8109a9a0>] ? posix_timer_fn+0x0/0xe0 > Feb 3 07:44:31 vm01 kernel: [<ffffffff8109a982>] ? > posix_timer_event+0x42/0x60 > Feb 3 07:44:31 vm01 kernel: [<ffffffff810a7159>] ? ktime_get+0x69/0xf0 > Feb 3 07:44:31 vm01 kernel: [<ffffffffa01267c4>] ? > mlx4_msi_x_interrupt+0x14/0x20 [mlx4_core] > Feb 3 07:44:31 vm01 kernel: [<ffffffff810e6ed0>] ? > handle_IRQ_event+0x60/0x170 > Feb 3 07:44:31 vm01 kernel: [<ffffffff810e982e>] ? handle_edge_irq+0xde/0x180 > Feb 3 07:44:31 vm01 kernel: [<ffffffff8100faf9>] ? handle_irq+0x49/0xa0 > Feb 3 07:44:31 vm01 kernel: [<ffffffff81530fec>] ? do_IRQ+0x6c/0xf0 > Feb 3 07:44:31 vm01 kernel: [<ffffffff8100b9d3>] ? ret_from_intr+0x0/0x11 > Feb 3 07:44:31 vm01 kernel: <EOI> [<ffffffff8103eacb>] ? > native_safe_halt+0xb/0x10 > Feb 3 07:44:31 vm01 kernel: [<ffffffff810167bd>] ? default_idle+0x4d/0xb0 > Feb 3 07:44:31 vm01 kernel: [<ffffffff810168bd>] ? c1e_idle+0x9d/0x120 > Feb 3 07:44:31 vm01 kernel: [<ffffffff81009fc6>] ? cpu_idle+0xb6/0x110 > Feb 3 07:44:31 vm01 kernel: [<ffffffff81520e2c>] ? > start_secondary+0x2ac/0x2ef > Feb 3 07:44:31 vm01 kernel: ---[ end trace 9d97672873a67a1f ]--- > > This looks like an error in the kernel. You should consult kernel/rpcrdma > folks with this error. > > Which cause sanlock to fail to update the lease on the storage (expected): > > Feb 3 07:44:58 vm01 sanlock[2536]: 2014-02-03 07:44:58-0600 240689 [2536]: > s1 check_our_lease failed 80 > > Then sanlock try to kill vdsm, the owner of the lease: > > Feb 3 07:44:58 vm01 sanlock[2536]: 2014-02-03 07:44:58-0600 240689 [2536]: > s1 kill 3030 sig 15 count 1 > ... > Feb 3 07:45:06 vm01 sanlock[2536]: 2014-02-03 07:45:06-0600 240698 [2536]: > s1 kill 3030 sig 15 count 10 > Feb 3 07:45:07 vm01 sanlock[2536]: 2014-02-03 07:45:07-0600 240698 [2536]: > dead 3030 ci 3 count 10 > > This makes the host Non-Responsive (expected). > > Now vdsm is restarted, which will make it responsive again: > > Feb 3 07:45:07 vm01 respawn: slave '/usr/share/vdsm/vdsm --pidfile > /var/run/vdsm/vdsmd.pid' died, respawning slave > > But sanlock since there is no access to storage, the host is Non Operational > (expected). > > The vm was starting a migration to the other host: > > Thread-26::DEBUG::2014-02-03 07:49:18,067::BindingXMLRPC::965::vds::(wrapper) > client [192.168.202.99]::call vmMigrate with ({'tunneled': 'false', > 'dstqemu': '192.168.202.103', > 'src': 'vm01.brazos.tamu.edu', 'dst': 'vm02.brazos.tamu.edu:54321', 'vmId': > '741f9811-db68-4dc4-a88a-7cb9be576e57', 'method': 'online'},) {} flowID > [7829ae2a] > Thread-26::DEBUG::2014-02-03 07:49:18,067::API::463::vds::(migrate) > {'tunneled': 'false', 'dstqemu': '192.168.202.103', 'src': > 'vm01.brazos.tamu.edu', 'dst': 'vm02.brazos.tamu. > edu:54321', 'vmId': '741f9811-db68-4dc4-a88a-7cb9be576e57', 'method': > 'online'} > Thread-26::DEBUG::2014-02-03 07:49:18,068::BindingXMLRPC::972::vds::(wrapper) > return vmMigrate with {'status': {'message': 'Migration in progress', 'code': > 0}, 'progress': 0} > > The migration was almost complete after 20 seconds: > > Thread-29::INFO::2014-02-03 07:49:38,329::vm::815::vm.Vm::(run) > vmId=`741f9811-db68-4dc4-a88a-7cb9be576e57`::Migration Progress: 20 seconds > elapsed, 99% of data processed, 99% of mem processed > > But it never completed: > > Thread-29::WARNING::2014-02-03 07:54:38,383::vm::792::vm.Vm::(run) > vmId=`741f9811-db68-4dc4-a88a-7cb9be576e57`::Migration is stuck: Hasn't > progressed in 300.054134846 seconds. Aborting. > > CCing Michal to inspect why the migration has failed. > >> >> Attached is tar of the logs from vm01, vm02 and the engine server. >> >> vm01 & vm02 folders contain files from '/var/log/messages >> /var/log/sanlock.log /var/log/vdsm/*.log' >> engine from '/var/log/messages /var/log/ovirt-engine/*.log' >> >> Thanks >> - Trey >> >> On Sun, Feb 9, 2014 at 4:15 PM, Itamar Heim <ih...@redhat.com> wrote: >> > On 02/03/2014 06:58 PM, Trey Dockendorf wrote: >> >> >> >> I have a 2 node oVirt 3.3.2 cluster setup and am evaluating the setup >> >> for production use on our HPC system for managing our VM >> >> infrastructure. Currently I'm trying to utilize our DDR InfiniBand >> >> fabric for the storage domains in oVirt using NFS over RDMA. I've >> >> noticed some unstable behavior and it seems in every case to begin >> >> with sanlock. >> >> >> >> The ovirt web admin interface shows the following message as first >> >> sign of trouble on 2014-Feb-03 07:45. >> >> >> >> "Invalid status on Data Center Default. Setting Data Center status to >> >> Non Responsive (On host vm01.brazos.tamu.edu, Error: Network error >> >> during communication with the Host.).". >> >> >> >> The single VM I had running is stuck in the "Migrating From" state. >> >> virsh shows the VM paused on the crashed host and the one it attempted >> >> to migrate to. >> >> >> >> Right now I have a few concerns. >> >> >> >> 1) The cause of the sanlock (or other instability) and if it's related >> >> to a bug or an issue using NFSoRDMA. > > vdsm and sanlock seem to behave as they should when storage is not accessible. > >> >> 2) Why the VM failed to migrate if the second host had no issues. If > > Virt team will have to answer this. > >> >> the first host is down should the VM be considered offline and booted >> >> on the second host after first is fenced? > > The host was not fenced, and was not down. It was up and the vm was still > running, possibly accessing the storage. > >> >> >> >> Attached are logs from the failed host (vm01) and the healthy host >> >> (vm02) as well as engine. The failed host's /var/log/message is also >> >> attached (vm01_message.log). >> >> >> >> Thanks >> >> - Trey >> >> >> >> >> >> >> >> _______________________________________________ >> >> Users mailing list >> >> Users@ovirt.org >> >> http://lists.ovirt.org/mailman/listinfo/users >> >> >> > >> > was this resolved? >> >> _______________________________________________ >> Users mailing list >> Users@ovirt.org >> http://lists.ovirt.org/mailman/listinfo/users >> _______________________________________________ Users mailing list Users@ovirt.org http://lists.ovirt.org/mailman/listinfo/users