[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

Thadeu Lima de Souza Cascardo Thu, 30 Aug 2018 07:06:47 -0700

** Also affects: linux (Ubuntu)
   Importance: Undecided
       Status: New


** Also affects: linux (Ubuntu Cosmic)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu Cosmic)
       Status: New => Confirmed

** Changed in: linux (Ubuntu Cosmic)
   Importance: Undecided => High

** Changed in: linux (Ubuntu Cosmic)
     Assignee: (unassigned) => Thadeu Lima de Souza Cascardo (cascardo)

** Changed in: linux (Ubuntu Cosmic)
       Status: Confirmed => In Progress

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774366

Title:
  Fix MCE handling for user access of poisoned device-dax mapping

Status in intel:
  Triaged
Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Cosmic:
  In Progress

Bug description:
  Description:

  Customer reports:

  —

  We've tried the ndctl error injection. Now the error injection is
  successful. But we have a couple of questions related with the
  poisoned block.

  Here are some tests/steps that I did:

  1. I mmap the first 10GB of /dev/dax13.0 to virtual address space and
  inject an error at offset 1GB (block offset should be 1GB/512bytes =
  2097152) which seems fine:

  [root@scaqae09celadm13 wning]# ndctl inject-error --uninject --block=2097152 
--count=1 namespace13.0
  Warning: Un-injecting previously injected errors here will
  not cause the kernel to 'forget' its badblock entries. Those
  have to be cleared through the normal process of writing
  the affected blocks

  {
  "dev":"namespace13.0",
  "mode":"dax",
  "size":518967525376,
  "uuid":"0738c8bd-3b3f-4989-9d0e-0e9c6006c810",
  "chardev":"dax13.0",
  "numa_node":0,
  "badblock_count":1,
  "badblocks":[

  { "offset":2097152, "length":1, "dimms":[ "nmem1" ] }
  ]
  }

  2. In my test program, I just try to read every address of the first
  10GB. At the first time, when I read the offset 1GB, I got the SIGBUS
  error, but in the sinfo struct of signal handler, the failed address
  is NULL and signal code is 128 which seems incorrect. But then if we
  run again, the unit test gets stuck here:

  rt_sigaction(SIGBUS,

  {0x400dd2, [], SA_RESTORER|SA_SIGINFO, 0x7fb5cf839270}
  , NULL, 8) = 0

  And here is the output of log messages:

  Apr 3 14:39:23 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 952b200000
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Uncorrected hardware memory 
error in user-access at 94eb300000
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce_notify_irq: 48 callbacks 
suppressed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: [Hardware Error]: Machine check 
events logged
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: reserved 
kernel page still referenced by 1 users
  Apr 3 14:51:11 scaqae09celadm13 kernel: Memory failure: 0x94eb300: recovery 
action for reserved kernel page: Failed
  Apr 3 14:51:11 scaqae09celadm13 kernel: mce: Memory error not recovered

  The program I use is (simple do memcpy and directly read from the
  target address):

  for (i = 0; i < DAX_MAPPING_SIZE; i++) // DAX_MAPPING_SIZE is 10G

  { total += peek(buf + i); }
  char peek(void *addr)

  { char temp[128]; memcpy(temp, addr, 1); return *(char *)addr; }
  May I ask do we missed steps in triggering the SIGBUS error?

  Target Kernel: 4.19
  Target Release: 18.10

To manage notifications about this bug go to:
https://bugs.launchpad.net/intel/+bug/1774366/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1774366] Re: Fix MCE handling for user access of poisoned device-dax mapping

Reply via email to