Hi, Josh.

Did you have any feedback from the customer regarding the test kernel?

How do you want to proceed with that?

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux-azure in Ubuntu.
https://bugs.launchpad.net/bugs/1802021

Title:
  [Hyper-V] srcu: Lock srcu_data structure in srcu_gp_start()

Status in linux-azure package in Ubuntu:
  Triaged

Bug description:
  We had a customer seeing traces like the following:

  tack trace from kern.log:
  2018-10-10T04:43:08.542464+00:00 hbp2ann-2 kernel: INFO: task 
kworker/u16:0:16678 blocked for more than 120 seconds.
  2018-10-10T04:43:08.542503+00:00 hbp2ann-2 kernel: Not tainted 
4.15.0-1023-azure #24~16.04.1-Ubuntu
  2018-10-10T04:43:08.542513+00:00 hbp2ann-2 kernel: "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
  2018-10-10T04:43:08.547366+00:00 hbp2ann-2 kernel: kworker/u16:0 D 0 16678 2 
0x80000000
  2018-10-10T04:43:08.547386+00:00 hbp2ann-2 kernel: Workqueue: events_unbound 
fsnotify_mark_destroy_workfn
  2018-10-10T04:43:08.547395+00:00 hbp2ann-2 kernel: Call Trace:
  2018-10-10T04:43:08.547413+00:00 hbp2ann-2 kernel: __schedule+0x3d6/0x8b0
  2018-10-10T04:43:08.547422+00:00 hbp2ann-2 kernel: ? 
check_preempt_wakeup+0xfb/0x240
  2018-10-10T04:43:08.547431+00:00 hbp2ann-2 kernel: ? 
sched_clock_local+0x17/0x90
  2018-10-10T04:43:08.547440+00:00 hbp2ann-2 kernel: schedule+0x36/0x80
  2018-10-10T04:43:08.547448+00:00 hbp2ann-2 kernel: 
schedule_timeout+0x1db/0x370
  2018-10-10T04:43:08.547458+00:00 hbp2ann-2 kernel: ? 
__enqueue_entity+0x5c/0x60
  2018-10-10T04:43:08.547467+00:00 hbp2ann-2 kernel: ? 
enqueue_entity+0x112/0x670
  2018-10-10T04:43:08.547477+00:00 hbp2ann-2 kernel: 
wait_for_completion+0xb4/0x140
  2018-10-10T04:43:08.547486+00:00 hbp2ann-2 kernel: ? wake_up_q+0x70/0x70
  2018-10-10T04:43:08.547510+00:00 hbp2ann-2 kernel: 
__synchronize_srcu.part.13+0x85/0xb0
  2018-10-10T04:43:08.547535+00:00 hbp2ann-2 kernel: ? 
trace_raw_output_rcu_utilization+0x50/0x50
  2018-10-10T04:43:08.547560+00:00 hbp2ann-2 kernel: synchronize_srcu+0xd3/0xe0
  2018-10-10T04:43:08.547594+00:00 hbp2ann-2 kernel: ? 
synchronize_srcu+0xd3/0xe0
  2018-10-10T04:43:08.547604+00:00 hbp2ann-2 kernel: 
fsnotify_mark_destroy_workfn+0x7c/0xe0
  2018-10-10T04:43:08.547612+00:00 hbp2ann-2 kernel: 
process_one_work+0x14d/0x410
  2018-10-10T04:43:08.547620+00:00 hbp2ann-2 kernel: worker_thread+0x4b/0x460
  2018-10-10T04:43:08.547628+00:00 hbp2ann-2 kernel: kthread+0x105/0x140
  2018-10-10T04:43:08.547637+00:00 hbp2ann-2 kernel: ? 
process_one_work+0x410/0x410
  2018-10-10T04:43:08.547645+00:00 hbp2ann-2 kernel: ? 
kthread_destroy_worker+0x50/0x50
  2018-10-10T04:43:08.547654+00:00 hbp2ann-2 kernel: ? do_syscall_64+0x73/0x130
  2018-10-10T04:43:08.547677+00:00 hbp2ann-2 kernel: ? SyS_exit_group+0x14/0x20
  2018-10-10T04:43:08.547685+00:00 hbp2ann-2 kernel: ret_from_fork+0x35/0x40

  Error Code: INFO: task kworker/u16:0:16678 blocked for more than 120
  seconds.

  We are seeing more issue with fsnotify related callbacks. These are
  not a soft/hard lockup but seem to significantly degrade the
  responsiveness of systemd (and from there everything else).

  The following upstream commit may fix this issue, but it is in Paul's
  RCU tree and not in linux-next or upstream yet:

  https://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-
  rcu.git/commit/?h=dev&id=1a05c0cd2fee234a10362cc8f66057557cbb291f

  srcu: Lock srcu_data structure in srcu_gp_start()
  The srcu_gp_start() function is called with the srcu_struct structure's
  ->lock held, but not with the srcu_data structure's ->lock.  This is
  problematic because this function accesses and updates the srcu_data
  structure's ->srcu_cblist, which is protected by that lock.  Failing to
  hold this lock can result in corruption of the SRCU callback lists,
  which in turn can result in arbitrarily bad results.

  This commit therefore makes srcu_gp_start() acquire the srcu_data
  structure's ->lock across the calls to rcu_segcblist_advance() and
  rcu_segcblist_accelerate(), thus preventing this corruption.

  Please investigate this issue and evaluate the proposed fix.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1802021/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to