Greetings folks, I am experiencing an issue with openvswitch-2.0.0。And I believe there is a deadlock in it。
Steps to reproduce the issue: 1.Install openvswitch-2.0.0 on a CentOS machine with kernel version 3.0.57。 2.Start up vswitchd and ovsdb。 3.Create bridge and add vxlan port: # ovs-vsctl add-br ovsbr1 # ovs-vsctl add-port ovsbr1 eth1 # ovs-vsctl add-port ovsbr1 vxlan2 -- set interface vxlan2 type=vxlan options:local_ip=11.12.13.4 options:remote_ip=flow 4.Stop vswitchd and ovsdb。 5.Rmmod openvswitch。 The command hangs up, and never get returned。After 120 seconds,system log get some outputs: 2147 Mar 19 01:51:53 NODE-4 kernel: [49343.232348] ovs_workq D ffff88007f052000 0 7558 2 0x00000000 2148 Mar 19 01:51:53 NODE-4 kernel: [49343.232355] ffff880044397d60 0000000000000246 ffff88004417e040 0000000000012000 2149 Mar 19 01:51:53 NODE-4 kernel: [49343.232362] ffff880044397fd8 ffff880000000000 ffff88004417e040 0000000000012000 2150 Mar 19 01:51:53 NODE-4 kernel: [49343.232369] ffff880044397fd8 ffff880044396010 ffff880044397fd8 0000000000012000 2151 Mar 19 01:51:53 NODE-4 kernel: [49343.232377] Call Trace: 2152 Mar 19 01:51:53 NODE-4 kernel: [49343.232386] [<ffffffff81974b4b>] ? xen_hypervisor_callback+0x1b/0x20 2153 Mar 19 01:51:53 NODE-4 kernel: [49343.232390] [<ffffffff8197378a>] ? error_exit+0x2a/0x60 2154 Mar 19 01:51:53 NODE-4 kernel: [49343.232393] [<ffffffff819732e1>] ? retint_restore_args+0x5/0x6 2155 Mar 19 01:51:53 NODE-4 kernel: [49343.232395] [<ffffffff81972da9>] ? _raw_spin_lock+0x9/0x10 2156 Mar 19 01:51:53 NODE-4 kernel: [49343.232401] [<ffffffff810e4dc6>] ? kmem_cache_free+0xc6/0x1a0 2157 Mar 19 01:51:53 NODE-4 kernel: [49343.232404] [<ffffffff8197108a>] schedule+0x3a/0x50 2158 Mar 19 01:51:53 NODE-4 kernel: [49343.232407] [<ffffffff8197197f>] __mutex_lock_slowpath+0xdf/0x160 2159 Mar 19 01:51:53 NODE-4 kernel: [49343.232413] [<ffffffffa00636b0>] ? vxlan_udp_encap_recv+0x160/0x160 [openvswitch] 2160 Mar 19 01:51:53 NODE-4 kernel: [49343.232416] [<ffffffff819717ce>] mutex_lock+0x1e/0x40 2161 Mar 19 01:51:53 NODE-4 kernel: [49343.232420] [<ffffffff81851378>] unregister_pernet_device+0x18/0x50 2162 Mar 19 01:51:53 NODE-4 kernel: [49343.232424] [<ffffffffa0063704>] vxlan_del_work+0x54/0x70 [openvswitch] 2163 Mar 19 01:51:53 NODE-4 kernel: [49343.232429] [<ffffffffa0063e01>] worker_thread+0xe1/0x1d0 [openvswitch] 2164 Mar 19 01:51:53 NODE-4 kernel: [49343.232433] [<ffffffff8106f020>] ? wake_up_bit+0x40/0x40 2165 Mar 19 01:51:53 NODE-4 kernel: [49343.232438] [<ffffffffa0063d20>] ? ovs_workqueues_exit+0x30/0x30 [openvswitch] 2166 Mar 19 01:51:53 NODE-4 kernel: [49343.232440] [<ffffffff8106eb66>] kthread+0x96/0xa0 2167 Mar 19 01:51:53 NODE-4 kernel: [49343.232443] [<ffffffff81974a24>] kernel_thread_helper+0x4/0x10 2168 Mar 19 01:51:53 NODE-4 kernel: [49343.232446] [<ffffffff81973b36>] ? int_ret_from_sys_call+0x7/0x1b 2169 Mar 19 01:51:53 NODE-4 kernel: [49343.232448] [<ffffffff819732e1>] ? retint_restore_args+0x5/0x6 2170 Mar 19 01:51:53 NODE-4 kernel: [49343.232451] [<ffffffff81974a20>] ? gs_change+0x13/0x13 2171 Mar 19 01:51:53 NODE-4 kernel: [49343.232453] INFO: task rmmod:8479 blocked for more than 120 seconds. 2172 Mar 19 01:51:53 NODE-4 kernel: [49343.232455] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 2173 Mar 19 01:51:53 NODE-4 kernel: [49343.232456] rmmod D ffff88007f1d2000 0 8479 1095 0x00000000 2174 Mar 19 01:51:53 NODE-4 kernel: [49343.232459] ffff880049c5bc18 0000000000000282 ffffffff32323532 ffffffff81d6730a 2175 Mar 19 01:51:53 NODE-4 kernel: [49343.232462] ffff880049c5bb78 ffff880000000000 ffff8800442844c0 0000000000012000 2176 Mar 19 01:51:53 NODE-4 kernel: [49343.232465] ffff880049c5bfd8 ffff880049c5a010 ffff880049c5bfd8 0000000000012000 2177 Mar 19 01:51:53 NODE-4 kernel: [49343.232468] Call Trace: 2178 Mar 19 01:51:53 NODE-4 kernel: [49343.232471] [<ffffffff81972e29>] ? _raw_spin_unlock_irqrestore+0x19/0x20 2179 Mar 19 01:51:53 NODE-4 kernel: [49343.232474] [<ffffffff81095332>] ? irq_to_desc+0x12/0x20 2180 Mar 19 01:51:53 NODE-4 kernel: [49343.232477] [<ffffffff810979d9>] ? irq_get_irq_data+0x9/0x10 2181 Mar 19 01:51:53 NODE-4 kernel: [49343.232480] [<ffffffff8141d669>] ? info_for_irq+0x9/0x20 2182 Mar 19 01:51:53 NODE-4 kernel: [49343.232483] [<ffffffff8197108a>] schedule+0x3a/0x50 2183 Mar 19 01:51:53 NODE-4 kernel: [49343.232486] [<ffffffff819714a5>] schedule_timeout+0x1a5/0x200 2184 Mar 19 01:51:53 NODE-4 kernel: [49343.232490] [<ffffffff81004bcf>] ? xen_pte_val+0x2f/0x80 2185 Mar 19 01:51:53 NODE-4 kernel: [49343.232492] [<ffffffff810e5837>] ? kfree+0x117/0x210 2186 Mar 19 01:51:53 NODE-4 kernel: [49343.232495] [<ffffffff810e5837>] ? kfree+0x117/0x210 2187 Mar 19 01:51:53 NODE-4 kernel: [49343.232497] [<ffffffff81970531>] wait_for_common+0xd1/0x180 2188 Mar 19 01:51:53 NODE-4 kernel: [49343.232501] [<ffffffff813b6e20>] ? kobject_del+0x40/0x40 2189 Mar 19 01:51:53 NODE-4 kernel: [49343.232506] [<ffffffff8104e210>] ? try_to_wake_up+0x260/0x260 2190 Mar 19 01:51:53 NODE-4 kernel: [49343.232509] [<ffffffff81970688>] wait_for_completion+0x18/0x20 2191 Mar 19 01:51:53 NODE-4 kernel: [49343.232513] [<ffffffffa0063f71>] __cancel_work_timer+0x81/0x1b0 [openvswitch] 2192 Mar 19 01:51:53 NODE-4 kernel: [49343.232516] [<ffffffff8104e183>] ? try_to_wake_up+0x1d3/0x260 2193 Mar 19 01:51:53 NODE-4 kernel: [49343.232521] [<ffffffffa00640d0>] ? rpl_cancel_delayed_work_sync+0x10/0x10 [openvswitch] 2194 Mar 19 01:51:53 NODE-4 kernel: [49343.232526] [<ffffffffa00640ab>] cancel_work_sync+0xb/0x20 [openvswitch] 2195 Mar 19 01:51:53 NODE-4 kernel: [49343.232530] [<ffffffffa00643a3>] ovs_exit_net+0x73/0x82 [openvswitch] 2196 Mar 19 01:51:53 NODE-4 kernel: [49343.232533] [<ffffffff818511d9>] ops_exit_list+0x39/0x60 2197 Mar 19 01:51:53 NODE-4 kernel: [49343.232535] [<ffffffff8185132d>] unregister_pernet_operations+0x3d/0x70 2198 Mar 19 01:51:53 NODE-4 kernel: [49343.232538] [<ffffffff81851389>] unregister_pernet_device+0x29/0x50 2199 Mar 19 01:51:53 NODE-4 kernel: [49343.232541] [<ffffffffa00583c8>] dp_cleanup+0x58/0x80 [openvswitch] 2200 Mar 19 01:51:53 NODE-4 kernel: [49343.232546] [<ffffffff810885a8>] sys_delete_module+0x178/0x240 Some investigations to the code: Thread rmmod: Dp_cleanup --> unregister_netdevice_notifier(&ovs_dp_device_notifier); --> dp_device_event --> (event == NETDEV_UNREGISTER) queue_work(&ovs_net->dp_notify_work); --> ovs_workq will deal with this unregister_pernet_device(&ovs_net_ops); --> acquire net_mutex unregister_pernet_operations --> ops_exit_list --> ovs_exit_net --> cancel_work_sync --> __cancel_work_timer --> workqueue_barrier --> queue a barrier work to ovs_workq, and waiting for it. wait_for_completion(&barr.done); Thread ovs_workq: worker_thread --> run_workqueue --> ovs_dp_notify_wq --> dp_detach_port_notify --> ovs_dp_detach_port --> ovs_vport_del --> vport->ops->destroy(vport); --> vxlan_tnl_destroy --> vxlan_sock_release --> queue_work(&vs->del_work); --> queue vxlan_del_work to ovs_workq worker_thread --> run_workqueue --> vxlan_del_work --> vxlan_cleanup_module --> unregister_pernet_device(&vxlan_net_ops); --> acquire net_mutex I believe there is a deadlock in it. After rmmod is issued, unregister_netdevice_notifier queue dp_notify_work to ovs_workq,then the thread go forward to unregister_pernet_device (&ovs_net_ops); In unregister_pernet_device(), thread rmmod holds net_mutex, and do ovs_exit_net, it adds an barrier work to the tail of ovs_workq workqueue. Then waiting for barrier to be done while holding net_mutex. On the other hand, thread ovs_workq is doing vxlan cleanup, to unregister_pernet_device(&vxlan_net_ops), it tries to acquire net_mutex, which is holding by rmmod. So vxlan_del_work never get finished, and the barrier work never get executed, rmmod can't continue and release net_mutex. Can anyone confirm this? And is there a fix for this issue? Thanks, Lin
_______________________________________________ discuss mailing list discuss@openvswitch.org http://openvswitch.org/mailman/listinfo/discuss