Re: Does CONFIG_HARDENED_USERCOPY break /dev/mem?

2017-11-23 Thread Michael Holzheu
Am Wed, 22 Nov 2017 09:43:19 -0800
schrieb Kees Cook <keesc...@chromium.org>:

> On Wed, Nov 22, 2017 at 1:28 AM, Michael Holzheu
> <holz...@linux.vnet.ibm.com> wrote:
> > Am Mon, 13 Nov 2017 11:19:38 +0100
> > schrieb Michael Holzheu <holz...@linux.vnet.ibm.com>:
> >
> >> Am Fri, 10 Nov 2017 10:46:49 -0800
> >> schrieb Kees Cook <keesc...@chromium.org>:
> >>
> >> > On Fri, Nov 10, 2017 at 7:45 AM, Michael Holzheu
> >> > <holz...@linux.vnet.ibm.com> wrote:
> >> > > Hello Kees,
> >> > >
> >> > > When I try to run the crash tool on my s390 live system I get a kernel 
> >> > > panic
> >> > > when reading memory within the kernel image:
> >> > >
> >> > >  # uname -a
> >> > >Linux r3545011 4.14.0-rc8-00066-g1c9dbd4615fd #45 SMP PREEMPT Fri 
> >> > > Nov 10 16:16:22 CET 2017 s390x s390x s390x GNU/Linux
> >> > >  # crash /boot/vmlinux-devel /dev/mem
> >> > >  # crash> rd 0x10
> >> > >
> >> > >  usercopy: kernel memory exposure attempt detected from 
> >> > > 0010 () (8 bytes)
> >> > >  [ cut here ]
> >> > >  kernel BUG at mm/usercopy.c:72!
> >> > >  illegal operation: 0001 ilc:1 [#1] PREEMPT SMP.
> >> > >  Modules linked in:
> >> > >  CPU: 0 PID: 1461 Comm: crash Not tainted 
> >> > > 4.14.0-rc8-00066-g1c9dbd4615fd-dirty #46
> >> > >  Hardware name: IBM 2827 H66 706 (z/VM 6.3.0)
> >> > >  task: 1ad10100 task.stack: 1df78000
> >> > >  Krnl PSW : 0704d0018000 0038165c 
> >> > > (__check_object_size+0x164/0x1d0)
> >> > > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 
> >> > > EA:3
> >> > >  Krnl GPRS: 12440e1d 8000 0061 
> >> > > 001cabc0
> >> > > 001cc6d6  00cc4ed2 
> >> > > 1000
> >> > > 03ffc22fdd20 0008 0018 
> >> > > 0001
> >> > > 0008 0010 00381658 
> >> > > 1df7bc90
> >> > >  Krnl Code: 0038164c: c020004a1c4alarl%r2,cc4ee0
> >> > > 00381652: c0e5fff2581bbrasl   %r14,1cc688
> >> > >#00381658: a7f40001brc 15,38165a
> >> > >>0038165c: eb42000c000csrlg%r4,%r2,12
> >> > > 00381662: eb32001c000csrlg%r3,%r2,28
> >> > > 00381668: c0110003lgfi%r1,262143
> >> > > 0038166e: ec31ff752065clgrj   
> >> > > %r3,%r1,2,381558
> >> > > 00381674: a7f4ff67brc 15,381542
> >> > >  Call Trace:
> >> > >  ([<00381658>] __check_object_size+0x160/0x1d0)
> >> > >   [<0082263a>] read_mem+0xaa/0x130.
> >> > >   [<00386182>] __vfs_read+0x42/0x168.
> >> > >   [<0038632e>] vfs_read+0x86/0x140.
> >> > >   [<00386a26>] SyS_read+0x66/0xc0.
> >> > >   [<00ace6a4>] system_call+0xc4/0x2b0.
> >> > >  INFO: lockdep is turned off.
> >> > >  Last Breaking-Event-Address:
> >> > >   [<00381658>] __check_object_size+0x160/0x1d0
> >> > >
> >> > >  Kernel panic - not syncing: Fatal exception: panic_on_oops
> >> > >
> >> > > With CONFIG_HARDENED_USERCOPY copy_to_user() checks in 
> >> > > __check_object_size()
> >> > > if the source address is within the kernel image:
> >> > >
> >> > >  - __check_object_size() -> check_kernel_text_object():
> >> > >
> >> > >  /* Is this address range in the kernel text area? */
> >> > >  static inline const char *check_kernel_text_object(const void *ptr,
> >> > > unsigned long n)
> >> > >  {
> >> > >  unsigned long textlow = (unsigned long)_stext;
> >> > >  unsigned long texthigh = (unsigned long)_etext;
> >&

Re: Does CONFIG_HARDENED_USERCOPY break /dev/mem?

2017-11-23 Thread Michael Holzheu
Am Wed, 22 Nov 2017 09:43:19 -0800
schrieb Kees Cook :

> On Wed, Nov 22, 2017 at 1:28 AM, Michael Holzheu
>  wrote:
> > Am Mon, 13 Nov 2017 11:19:38 +0100
> > schrieb Michael Holzheu :
> >
> >> Am Fri, 10 Nov 2017 10:46:49 -0800
> >> schrieb Kees Cook :
> >>
> >> > On Fri, Nov 10, 2017 at 7:45 AM, Michael Holzheu
> >> >  wrote:
> >> > > Hello Kees,
> >> > >
> >> > > When I try to run the crash tool on my s390 live system I get a kernel 
> >> > > panic
> >> > > when reading memory within the kernel image:
> >> > >
> >> > >  # uname -a
> >> > >Linux r3545011 4.14.0-rc8-00066-g1c9dbd4615fd #45 SMP PREEMPT Fri 
> >> > > Nov 10 16:16:22 CET 2017 s390x s390x s390x GNU/Linux
> >> > >  # crash /boot/vmlinux-devel /dev/mem
> >> > >  # crash> rd 0x10
> >> > >
> >> > >  usercopy: kernel memory exposure attempt detected from 
> >> > > 0010 () (8 bytes)
> >> > >  [ cut here ]
> >> > >  kernel BUG at mm/usercopy.c:72!
> >> > >  illegal operation: 0001 ilc:1 [#1] PREEMPT SMP.
> >> > >  Modules linked in:
> >> > >  CPU: 0 PID: 1461 Comm: crash Not tainted 
> >> > > 4.14.0-rc8-00066-g1c9dbd4615fd-dirty #46
> >> > >  Hardware name: IBM 2827 H66 706 (z/VM 6.3.0)
> >> > >  task: 1ad10100 task.stack: 1df78000
> >> > >  Krnl PSW : 0704d0018000 0038165c 
> >> > > (__check_object_size+0x164/0x1d0)
> >> > > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 
> >> > > EA:3
> >> > >  Krnl GPRS: 12440e1d 8000 0061 
> >> > > 001cabc0
> >> > > 001cc6d6  00cc4ed2 
> >> > > 1000
> >> > > 03ffc22fdd20 0008 0018 
> >> > > 0001
> >> > > 0008 0010 00381658 
> >> > > 1df7bc90
> >> > >  Krnl Code: 0038164c: c020004a1c4alarl%r2,cc4ee0
> >> > > 00381652: c0e5fff2581bbrasl   %r14,1cc688
> >> > >#00381658: a7f40001brc 15,38165a
> >> > >>0038165c: eb42000c000csrlg%r4,%r2,12
> >> > > 00381662: eb32001c000csrlg%r3,%r2,28
> >> > > 00381668: c0110003lgfi%r1,262143
> >> > > 0038166e: ec31ff752065clgrj   
> >> > > %r3,%r1,2,381558
> >> > > 00381674: a7f4ff67brc 15,381542
> >> > >  Call Trace:
> >> > >  ([<00381658>] __check_object_size+0x160/0x1d0)
> >> > >   [<0082263a>] read_mem+0xaa/0x130.
> >> > >   [<00386182>] __vfs_read+0x42/0x168.
> >> > >   [<0038632e>] vfs_read+0x86/0x140.
> >> > >   [<00386a26>] SyS_read+0x66/0xc0.
> >> > >   [<00ace6a4>] system_call+0xc4/0x2b0.
> >> > >  INFO: lockdep is turned off.
> >> > >  Last Breaking-Event-Address:
> >> > >   [<00381658>] __check_object_size+0x160/0x1d0
> >> > >
> >> > >  Kernel panic - not syncing: Fatal exception: panic_on_oops
> >> > >
> >> > > With CONFIG_HARDENED_USERCOPY copy_to_user() checks in 
> >> > > __check_object_size()
> >> > > if the source address is within the kernel image:
> >> > >
> >> > >  - __check_object_size() -> check_kernel_text_object():
> >> > >
> >> > >  /* Is this address range in the kernel text area? */
> >> > >  static inline const char *check_kernel_text_object(const void *ptr,
> >> > > unsigned long n)
> >> > >  {
> >> > >  unsigned long textlow = (unsigned long)_stext;
> >> > >  unsigned long texthigh = (unsigned long)_etext;
> >> > >  unsigned long textlow_linear, texthigh_linear;
> >> > >
> >> > >  if (overlaps(p

Re: Does CONFIG_HARDENED_USERCOPY break /dev/mem?

2017-11-22 Thread Michael Holzheu
Am Mon, 13 Nov 2017 11:19:38 +0100
schrieb Michael Holzheu <holz...@linux.vnet.ibm.com>:

> Am Fri, 10 Nov 2017 10:46:49 -0800
> schrieb Kees Cook <keesc...@chromium.org>:
> 
> > On Fri, Nov 10, 2017 at 7:45 AM, Michael Holzheu
> > <holz...@linux.vnet.ibm.com> wrote:
> > > Hello Kees,
> > >
> > > When I try to run the crash tool on my s390 live system I get a kernel 
> > > panic
> > > when reading memory within the kernel image:
> > >
> > >  # uname -a
> > >Linux r3545011 4.14.0-rc8-00066-g1c9dbd4615fd #45 SMP PREEMPT Fri Nov 
> > > 10 16:16:22 CET 2017 s390x s390x s390x GNU/Linux
> > >  # crash /boot/vmlinux-devel /dev/mem
> > >  # crash> rd 0x10
> > >
> > >  usercopy: kernel memory exposure attempt detected from 0010 
> > > () (8 bytes)
> > >  [ cut here ]
> > >  kernel BUG at mm/usercopy.c:72!
> > >  illegal operation: 0001 ilc:1 [#1] PREEMPT SMP.
> > >  Modules linked in:
> > >  CPU: 0 PID: 1461 Comm: crash Not tainted 
> > > 4.14.0-rc8-00066-g1c9dbd4615fd-dirty #46
> > >  Hardware name: IBM 2827 H66 706 (z/VM 6.3.0)
> > >  task: 1ad10100 task.stack: 1df78000
> > >  Krnl PSW : 0704d0018000 0038165c 
> > > (__check_object_size+0x164/0x1d0)
> > > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> > >  Krnl GPRS: 12440e1d 8000 0061 
> > > 001cabc0
> > > 001cc6d6  00cc4ed2 
> > > 1000
> > > 03ffc22fdd20 0008 0018 
> > > 0001
> > > 0008 0010 00381658 
> > > 1df7bc90
> > >  Krnl Code: 0038164c: c020004a1c4alarl%r2,cc4ee0
> > > 00381652: c0e5fff2581bbrasl   %r14,1cc688
> > >#00381658: a7f40001brc 15,38165a
> > >>0038165c: eb42000c000csrlg%r4,%r2,12
> > > 00381662: eb32001c000csrlg%r3,%r2,28
> > > 00381668: c0110003lgfi%r1,262143
> > > 0038166e: ec31ff752065clgrj   %r3,%r1,2,381558
> > > 00381674: a7f4ff67brc 15,381542
> > >  Call Trace:
> > >  ([<00381658>] __check_object_size+0x160/0x1d0)
> > >   [<0082263a>] read_mem+0xaa/0x130.
> > >   [<00386182>] __vfs_read+0x42/0x168.
> > >   [<0038632e>] vfs_read+0x86/0x140.
> > >   [<00386a26>] SyS_read+0x66/0xc0.
> > >   [<00ace6a4>] system_call+0xc4/0x2b0.
> > >  INFO: lockdep is turned off.
> > >  Last Breaking-Event-Address:
> > >   [<00381658>] __check_object_size+0x160/0x1d0
> > >
> > >  Kernel panic - not syncing: Fatal exception: panic_on_oops
> > >
> > > With CONFIG_HARDENED_USERCOPY copy_to_user() checks in 
> > > __check_object_size()
> > > if the source address is within the kernel image:
> > >
> > >  - __check_object_size() -> check_kernel_text_object():
> > >
> > >  /* Is this address range in the kernel text area? */
> > >  static inline const char *check_kernel_text_object(const void *ptr,
> > > unsigned long n)
> > >  {
> > >  unsigned long textlow = (unsigned long)_stext;
> > >  unsigned long texthigh = (unsigned long)_etext;
> > >  unsigned long textlow_linear, texthigh_linear;
> > >
> > >  if (overlaps(ptr, n, textlow, texthigh))
> > >  return "";
> > >
> > > When the crash tool reads from 0x10, this check leads to the kernel 
> > > BUG()
> > > in drivers/char/mem.c:
> > >
> > >  144 } else {
> > >  145 /*
> > >  146  * On ia64 if a page has been mapped 
> > > somewhere as
> > >  147  * uncached, then it must also be accessed 
> > > uncached
> > >  148  * by the kernel or data corruption may 
> > > occur.
> > >  149  */
> > >  150 ptr = xlate_dev_mem_p

Re: Does CONFIG_HARDENED_USERCOPY break /dev/mem?

2017-11-22 Thread Michael Holzheu
Am Mon, 13 Nov 2017 11:19:38 +0100
schrieb Michael Holzheu :

> Am Fri, 10 Nov 2017 10:46:49 -0800
> schrieb Kees Cook :
> 
> > On Fri, Nov 10, 2017 at 7:45 AM, Michael Holzheu
> >  wrote:
> > > Hello Kees,
> > >
> > > When I try to run the crash tool on my s390 live system I get a kernel 
> > > panic
> > > when reading memory within the kernel image:
> > >
> > >  # uname -a
> > >Linux r3545011 4.14.0-rc8-00066-g1c9dbd4615fd #45 SMP PREEMPT Fri Nov 
> > > 10 16:16:22 CET 2017 s390x s390x s390x GNU/Linux
> > >  # crash /boot/vmlinux-devel /dev/mem
> > >  # crash> rd 0x10
> > >
> > >  usercopy: kernel memory exposure attempt detected from 0010 
> > > () (8 bytes)
> > >  [ cut here ]
> > >  kernel BUG at mm/usercopy.c:72!
> > >  illegal operation: 0001 ilc:1 [#1] PREEMPT SMP.
> > >  Modules linked in:
> > >  CPU: 0 PID: 1461 Comm: crash Not tainted 
> > > 4.14.0-rc8-00066-g1c9dbd4615fd-dirty #46
> > >  Hardware name: IBM 2827 H66 706 (z/VM 6.3.0)
> > >  task: 1ad10100 task.stack: 1df78000
> > >  Krnl PSW : 0704d0018000 0038165c 
> > > (__check_object_size+0x164/0x1d0)
> > > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> > >  Krnl GPRS: 12440e1d 8000 0061 
> > > 001cabc0
> > > 001cc6d6  00cc4ed2 
> > > 1000
> > > 03ffc22fdd20 0008 0018 
> > > 0001
> > > 0008 0010 00381658 
> > > 1df7bc90
> > >  Krnl Code: 0038164c: c020004a1c4alarl%r2,cc4ee0
> > > 00381652: c0e5fff2581bbrasl   %r14,1cc688
> > >#00381658: a7f40001brc 15,38165a
> > >>0038165c: eb42000c000csrlg%r4,%r2,12
> > > 00381662: eb32001c000csrlg%r3,%r2,28
> > > 00381668: c0110003lgfi%r1,262143
> > > 0038166e: ec31ff752065clgrj   %r3,%r1,2,381558
> > > 00381674: a7f4ff67brc 15,381542
> > >  Call Trace:
> > >  ([<00381658>] __check_object_size+0x160/0x1d0)
> > >   [<0082263a>] read_mem+0xaa/0x130.
> > >   [<00386182>] __vfs_read+0x42/0x168.
> > >   [<0038632e>] vfs_read+0x86/0x140.
> > >   [<00386a26>] SyS_read+0x66/0xc0.
> > >   [<00ace6a4>] system_call+0xc4/0x2b0.
> > >  INFO: lockdep is turned off.
> > >  Last Breaking-Event-Address:
> > >   [<00381658>] __check_object_size+0x160/0x1d0
> > >
> > >  Kernel panic - not syncing: Fatal exception: panic_on_oops
> > >
> > > With CONFIG_HARDENED_USERCOPY copy_to_user() checks in 
> > > __check_object_size()
> > > if the source address is within the kernel image:
> > >
> > >  - __check_object_size() -> check_kernel_text_object():
> > >
> > >  /* Is this address range in the kernel text area? */
> > >  static inline const char *check_kernel_text_object(const void *ptr,
> > > unsigned long n)
> > >  {
> > >  unsigned long textlow = (unsigned long)_stext;
> > >  unsigned long texthigh = (unsigned long)_etext;
> > >  unsigned long textlow_linear, texthigh_linear;
> > >
> > >  if (overlaps(ptr, n, textlow, texthigh))
> > >  return "";
> > >
> > > When the crash tool reads from 0x10, this check leads to the kernel 
> > > BUG()
> > > in drivers/char/mem.c:
> > >
> > >  144 } else {
> > >  145 /*
> > >  146  * On ia64 if a page has been mapped 
> > > somewhere as
> > >  147  * uncached, then it must also be accessed 
> > > uncached
> > >  148  * by the kernel or data corruption may 
> > > occur.
> > >  149  */
> > >  150 ptr = xlate_dev_mem_ptr(p);
> > >  151 if (!ptr)
> > >  152   

Re: Does CONFIG_HARDENED_USERCOPY break /dev/mem?

2017-11-13 Thread Michael Holzheu
Am Fri, 10 Nov 2017 10:46:49 -0800
schrieb Kees Cook <keesc...@chromium.org>:

> On Fri, Nov 10, 2017 at 7:45 AM, Michael Holzheu
> <holz...@linux.vnet.ibm.com> wrote:
> > Hello Kees,
> >
> > When I try to run the crash tool on my s390 live system I get a kernel panic
> > when reading memory within the kernel image:
> >
> >  # uname -a
> >Linux r3545011 4.14.0-rc8-00066-g1c9dbd4615fd #45 SMP PREEMPT Fri Nov 10 
> > 16:16:22 CET 2017 s390x s390x s390x GNU/Linux
> >  # crash /boot/vmlinux-devel /dev/mem
> >  # crash> rd 0x10
> >
> >  usercopy: kernel memory exposure attempt detected from 0010 
> > () (8 bytes)
> >  [ cut here ]
> >  kernel BUG at mm/usercopy.c:72!
> >  illegal operation: 0001 ilc:1 [#1] PREEMPT SMP.
> >  Modules linked in:
> >  CPU: 0 PID: 1461 Comm: crash Not tainted 
> > 4.14.0-rc8-00066-g1c9dbd4615fd-dirty #46
> >  Hardware name: IBM 2827 H66 706 (z/VM 6.3.0)
> >  task: 1ad10100 task.stack: 1df78000
> >  Krnl PSW : 0704d0018000 0038165c 
> > (__check_object_size+0x164/0x1d0)
> > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> >  Krnl GPRS: 12440e1d 8000 0061 
> > 001cabc0
> > 001cc6d6  00cc4ed2 
> > 1000
> > 03ffc22fdd20 0008 0018 
> > 0001
> > 0008 0010 00381658 
> > 1df7bc90
> >  Krnl Code: 0038164c: c020004a1c4alarl%r2,cc4ee0
> > 00381652: c0e5fff2581bbrasl   %r14,1cc688
> >#00381658: a7f40001brc 15,38165a
> >>0038165c: eb42000c000csrlg%r4,%r2,12
> > 00381662: eb32001c000csrlg%r3,%r2,28
> > 00381668: c0110003lgfi%r1,262143
> > 0038166e: ec31ff752065clgrj   %r3,%r1,2,381558
> > 00381674: a7f4ff67brc 15,381542
> >  Call Trace:
> >  ([<00381658>] __check_object_size+0x160/0x1d0)
> >   [<0082263a>] read_mem+0xaa/0x130.
> >   [<00386182>] __vfs_read+0x42/0x168.
> >   [<0038632e>] vfs_read+0x86/0x140.
> >   [<00386a26>] SyS_read+0x66/0xc0.
> >   [<00ace6a4>] system_call+0xc4/0x2b0.
> >  INFO: lockdep is turned off.
> >  Last Breaking-Event-Address:
> >   [<00381658>] __check_object_size+0x160/0x1d0
> >
> >  Kernel panic - not syncing: Fatal exception: panic_on_oops
> >
> > With CONFIG_HARDENED_USERCOPY copy_to_user() checks in __check_object_size()
> > if the source address is within the kernel image:
> >
> >  - __check_object_size() -> check_kernel_text_object():
> >
> >  /* Is this address range in the kernel text area? */
> >  static inline const char *check_kernel_text_object(const void *ptr,
> > unsigned long n)
> >  {
> >  unsigned long textlow = (unsigned long)_stext;
> >  unsigned long texthigh = (unsigned long)_etext;
> >  unsigned long textlow_linear, texthigh_linear;
> >
> >  if (overlaps(ptr, n, textlow, texthigh))
> >  return "";
> >
> > When the crash tool reads from 0x10, this check leads to the kernel 
> > BUG()
> > in drivers/char/mem.c:
> >
> >  144 } else {
> >  145 /*
> >  146  * On ia64 if a page has been mapped somewhere 
> > as
> >  147  * uncached, then it must also be accessed 
> > uncached
> >  148  * by the kernel or data corruption may occur.
> >  149  */
> >  150 ptr = xlate_dev_mem_ptr(p);
> >  151 if (!ptr)
> >  152 return -EFAULT;
> >  153
> >  154 remaining = copy_to_user(buf, ptr, sz); 
> > <<< BUG
> >  155
> >  156 unxlate_dev_mem_ptr(p, ptr);
> >  157 }
> >
> > Here the reporting function in mm/usercopy.c:
> >
> >  61 static void report_usercopy(const void *ptr, unsigned long len,
> >  62 bool to_us

Re: Does CONFIG_HARDENED_USERCOPY break /dev/mem?

2017-11-13 Thread Michael Holzheu
Am Fri, 10 Nov 2017 10:46:49 -0800
schrieb Kees Cook :

> On Fri, Nov 10, 2017 at 7:45 AM, Michael Holzheu
>  wrote:
> > Hello Kees,
> >
> > When I try to run the crash tool on my s390 live system I get a kernel panic
> > when reading memory within the kernel image:
> >
> >  # uname -a
> >Linux r3545011 4.14.0-rc8-00066-g1c9dbd4615fd #45 SMP PREEMPT Fri Nov 10 
> > 16:16:22 CET 2017 s390x s390x s390x GNU/Linux
> >  # crash /boot/vmlinux-devel /dev/mem
> >  # crash> rd 0x10
> >
> >  usercopy: kernel memory exposure attempt detected from 0010 
> > () (8 bytes)
> >  [ cut here ]
> >  kernel BUG at mm/usercopy.c:72!
> >  illegal operation: 0001 ilc:1 [#1] PREEMPT SMP.
> >  Modules linked in:
> >  CPU: 0 PID: 1461 Comm: crash Not tainted 
> > 4.14.0-rc8-00066-g1c9dbd4615fd-dirty #46
> >  Hardware name: IBM 2827 H66 706 (z/VM 6.3.0)
> >  task: 1ad10100 task.stack: 1df78000
> >  Krnl PSW : 0704d0018000 0038165c 
> > (__check_object_size+0x164/0x1d0)
> > R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
> >  Krnl GPRS: 12440e1d 8000 0061 
> > 001cabc0
> > 001cc6d6  00cc4ed2 
> > 1000
> > 03ffc22fdd20 0008 0018 
> > 0001
> > 0008 0010 00381658 
> > 1df7bc90
> >  Krnl Code: 0038164c: c020004a1c4alarl%r2,cc4ee0
> > 00381652: c0e5fff2581bbrasl   %r14,1cc688
> >#00381658: a7f40001brc 15,38165a
> >>0038165c: eb42000c000csrlg%r4,%r2,12
> > 00381662: eb32001c000csrlg%r3,%r2,28
> > 00381668: c0110003lgfi%r1,262143
> > 0038166e: ec31ff752065clgrj   %r3,%r1,2,381558
> > 00381674: a7f4ff67brc 15,381542
> >  Call Trace:
> >  ([<00381658>] __check_object_size+0x160/0x1d0)
> >   [<0082263a>] read_mem+0xaa/0x130.
> >   [<00386182>] __vfs_read+0x42/0x168.
> >   [<0038632e>] vfs_read+0x86/0x140.
> >   [<00386a26>] SyS_read+0x66/0xc0.
> >   [<00ace6a4>] system_call+0xc4/0x2b0.
> >  INFO: lockdep is turned off.
> >  Last Breaking-Event-Address:
> >   [<00381658>] __check_object_size+0x160/0x1d0
> >
> >  Kernel panic - not syncing: Fatal exception: panic_on_oops
> >
> > With CONFIG_HARDENED_USERCOPY copy_to_user() checks in __check_object_size()
> > if the source address is within the kernel image:
> >
> >  - __check_object_size() -> check_kernel_text_object():
> >
> >  /* Is this address range in the kernel text area? */
> >  static inline const char *check_kernel_text_object(const void *ptr,
> > unsigned long n)
> >  {
> >  unsigned long textlow = (unsigned long)_stext;
> >  unsigned long texthigh = (unsigned long)_etext;
> >  unsigned long textlow_linear, texthigh_linear;
> >
> >  if (overlaps(ptr, n, textlow, texthigh))
> >  return "";
> >
> > When the crash tool reads from 0x10, this check leads to the kernel 
> > BUG()
> > in drivers/char/mem.c:
> >
> >  144 } else {
> >  145 /*
> >  146  * On ia64 if a page has been mapped somewhere 
> > as
> >  147  * uncached, then it must also be accessed 
> > uncached
> >  148  * by the kernel or data corruption may occur.
> >  149  */
> >  150 ptr = xlate_dev_mem_ptr(p);
> >  151 if (!ptr)
> >  152 return -EFAULT;
> >  153
> >  154 remaining = copy_to_user(buf, ptr, sz); 
> > <<< BUG
> >  155
> >  156 unxlate_dev_mem_ptr(p, ptr);
> >  157 }
> >
> > Here the reporting function in mm/usercopy.c:
> >
> >  61 static void report_usercopy(const void *ptr, unsigned long len,
> >  62 bool to_user, const char *type)
> >  63 {
> >  64 pr_emerg("

Does CONFIG_HARDENED_USERCOPY break /dev/mem?

2017-11-10 Thread Michael Holzheu
Hello Kees,

When I try to run the crash tool on my s390 live system I get a kernel panic
when reading memory within the kernel image:

 # uname -a
   Linux r3545011 4.14.0-rc8-00066-g1c9dbd4615fd #45 SMP PREEMPT Fri Nov 10 
16:16:22 CET 2017 s390x s390x s390x GNU/Linux
 # crash /boot/vmlinux-devel /dev/mem
 # crash> rd 0x10

 usercopy: kernel memory exposure attempt detected from 0010 
() (8 bytes)
 [ cut here ]
 kernel BUG at mm/usercopy.c:72! 
 illegal operation: 0001 ilc:1 [#1] PREEMPT SMP.
 Modules linked in:
 CPU: 0 PID: 1461 Comm: crash Not tainted 4.14.0-rc8-00066-g1c9dbd4615fd-dirty 
#46
 Hardware name: IBM 2827 H66 706 (z/VM 6.3.0)
 task: 1ad10100 task.stack: 1df78000 
 Krnl PSW : 0704d0018000 0038165c (__check_object_size+0x164/0x1d0)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
 Krnl GPRS: 12440e1d 8000 0061 001cabc0
001cc6d6  00cc4ed2 1000
03ffc22fdd20 0008 0018 0001
0008 0010 00381658 1df7bc90
 Krnl Code: 0038164c: c020004a1c4alarl%r2,cc4ee0
00381652: c0e5fff2581bbrasl   %r14,1cc688   
   #00381658: a7f40001brc 15,38165a
   >0038165c: eb42000c000csrlg%r4,%r2,12
00381662: eb32001c000csrlg%r3,%r2,28
00381668: c0110003lgfi%r1,262143
0038166e: ec31ff752065clgrj   %r3,%r1,2,381558
00381674: a7f4ff67brc 15,381542
 Call Trace:
 ([<00381658>] __check_object_size+0x160/0x1d0)
  [<0082263a>] read_mem+0xaa/0x130.
  [<00386182>] __vfs_read+0x42/0x168.
  [<0038632e>] vfs_read+0x86/0x140.
  [<00386a26>] SyS_read+0x66/0xc0.
  [<00ace6a4>] system_call+0xc4/0x2b0.
 INFO: lockdep is turned off.
 Last Breaking-Event-Address:
  [<00381658>] __check_object_size+0x160/0x1d0

 Kernel panic - not syncing: Fatal exception: panic_on_oops

With CONFIG_HARDENED_USERCOPY copy_to_user() checks in __check_object_size()
if the source address is within the kernel image:

 - __check_object_size() -> check_kernel_text_object():

 /* Is this address range in the kernel text area? */
 static inline const char *check_kernel_text_object(const void *ptr,
unsigned long n)
 {
 unsigned long textlow = (unsigned long)_stext;
 unsigned long texthigh = (unsigned long)_etext;
 unsigned long textlow_linear, texthigh_linear;

 if (overlaps(ptr, n, textlow, texthigh))
 return "";

When the crash tool reads from 0x10, this check leads to the kernel BUG()
in drivers/char/mem.c:

 144 } else {
 145 /*
 146  * On ia64 if a page has been mapped somewhere as
 147  * uncached, then it must also be accessed uncached
 148  * by the kernel or data corruption may occur.
 149  */
 150 ptr = xlate_dev_mem_ptr(p);
 151 if (!ptr)
 152 return -EFAULT;
 153 
 154 remaining = copy_to_user(buf, ptr, sz); <<< BUG
 155 
 156 unxlate_dev_mem_ptr(p, ptr);
 157 }

Here the reporting function in mm/usercopy.c:

 61 static void report_usercopy(const void *ptr, unsigned long len,
 62 bool to_user, const char *type)
 63 {
 64 pr_emerg("kernel memory %s attempt detected %s %p (%s) (%lu 
bytes)\n",
 65 to_user ? "exposure" : "overwrite",
 66 to_user ? "from" : "to", ptr, type ? : "unknown", len);
 67 /*
 68  * For greater effect, it would be nice to do do_group_exit(),
 69  * but BUG() actually hooks all the lock-breaking and per-arch
 70  * Oops code, so that is used here instead.
 71  */
 72 BUG();
 73 }

Shouldn't we skip the kernel address check for /dev/mem - at least when
CONFIG_STRICT_DEVMEM is not enabled?

Michael



Does CONFIG_HARDENED_USERCOPY break /dev/mem?

2017-11-10 Thread Michael Holzheu
Hello Kees,

When I try to run the crash tool on my s390 live system I get a kernel panic
when reading memory within the kernel image:

 # uname -a
   Linux r3545011 4.14.0-rc8-00066-g1c9dbd4615fd #45 SMP PREEMPT Fri Nov 10 
16:16:22 CET 2017 s390x s390x s390x GNU/Linux
 # crash /boot/vmlinux-devel /dev/mem
 # crash> rd 0x10

 usercopy: kernel memory exposure attempt detected from 0010 
() (8 bytes)
 [ cut here ]
 kernel BUG at mm/usercopy.c:72! 
 illegal operation: 0001 ilc:1 [#1] PREEMPT SMP.
 Modules linked in:
 CPU: 0 PID: 1461 Comm: crash Not tainted 4.14.0-rc8-00066-g1c9dbd4615fd-dirty 
#46
 Hardware name: IBM 2827 H66 706 (z/VM 6.3.0)
 task: 1ad10100 task.stack: 1df78000 
 Krnl PSW : 0704d0018000 0038165c (__check_object_size+0x164/0x1d0)
R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
 Krnl GPRS: 12440e1d 8000 0061 001cabc0
001cc6d6  00cc4ed2 1000
03ffc22fdd20 0008 0018 0001
0008 0010 00381658 1df7bc90
 Krnl Code: 0038164c: c020004a1c4alarl%r2,cc4ee0
00381652: c0e5fff2581bbrasl   %r14,1cc688   
   #00381658: a7f40001brc 15,38165a
   >0038165c: eb42000c000csrlg%r4,%r2,12
00381662: eb32001c000csrlg%r3,%r2,28
00381668: c0110003lgfi%r1,262143
0038166e: ec31ff752065clgrj   %r3,%r1,2,381558
00381674: a7f4ff67brc 15,381542
 Call Trace:
 ([<00381658>] __check_object_size+0x160/0x1d0)
  [<0082263a>] read_mem+0xaa/0x130.
  [<00386182>] __vfs_read+0x42/0x168.
  [<0038632e>] vfs_read+0x86/0x140.
  [<00386a26>] SyS_read+0x66/0xc0.
  [<00ace6a4>] system_call+0xc4/0x2b0.
 INFO: lockdep is turned off.
 Last Breaking-Event-Address:
  [<00381658>] __check_object_size+0x160/0x1d0

 Kernel panic - not syncing: Fatal exception: panic_on_oops

With CONFIG_HARDENED_USERCOPY copy_to_user() checks in __check_object_size()
if the source address is within the kernel image:

 - __check_object_size() -> check_kernel_text_object():

 /* Is this address range in the kernel text area? */
 static inline const char *check_kernel_text_object(const void *ptr,
unsigned long n)
 {
 unsigned long textlow = (unsigned long)_stext;
 unsigned long texthigh = (unsigned long)_etext;
 unsigned long textlow_linear, texthigh_linear;

 if (overlaps(ptr, n, textlow, texthigh))
 return "";

When the crash tool reads from 0x10, this check leads to the kernel BUG()
in drivers/char/mem.c:

 144 } else {
 145 /*
 146  * On ia64 if a page has been mapped somewhere as
 147  * uncached, then it must also be accessed uncached
 148  * by the kernel or data corruption may occur.
 149  */
 150 ptr = xlate_dev_mem_ptr(p);
 151 if (!ptr)
 152 return -EFAULT;
 153 
 154 remaining = copy_to_user(buf, ptr, sz); <<< BUG
 155 
 156 unxlate_dev_mem_ptr(p, ptr);
 157 }

Here the reporting function in mm/usercopy.c:

 61 static void report_usercopy(const void *ptr, unsigned long len,
 62 bool to_user, const char *type)
 63 {
 64 pr_emerg("kernel memory %s attempt detected %s %p (%s) (%lu 
bytes)\n",
 65 to_user ? "exposure" : "overwrite",
 66 to_user ? "from" : "to", ptr, type ? : "unknown", len);
 67 /*
 68  * For greater effect, it would be nice to do do_group_exit(),
 69  * but BUG() actually hooks all the lock-breaking and per-arch
 70  * Oops code, so that is used here instead.
 71  */
 72 BUG();
 73 }

Shouldn't we skip the kernel address check for /dev/mem - at least when
CONFIG_STRICT_DEVMEM is not enabled?

Michael



Re: [PATCH] s390/crash: Fix KEXEC_NOTE_BYTES definition

2017-06-21 Thread Michael Holzheu
Am Fri,  9 Jun 2017 10:17:05 +0800
schrieb Xunlei Pang <xlp...@redhat.com>:

> S390 KEXEC_NOTE_BYTES is not used by note_buf_t as before, which
> is now defined as follows:
> typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
> It was changed by the CONFIG_CRASH_CORE feature.
> 
> This patch gets rid of all the old KEXEC_NOTE_BYTES stuff, and
> renames KEXEC_NOTE_BYTES to CRASH_CORE_NOTE_BYTES for S390.
> 
> Fixes: 692f66f26a4c ("crash: move crashkernel parsing and vmcore related code 
> under CONFIG_CRASH_CORE")
> Cc: Dave Young <dyo...@redhat.com>
> Cc: Dave Anderson <ander...@redhat.com>
> Cc: Hari Bathini <hbath...@linux.vnet.ibm.com>
> Cc: Gustavo Luiz Duarte <gustav...@linux.vnet.ibm.com>
> Signed-off-by: Xunlei Pang <xlp...@redhat.com>

Hello Xunlei,

As you already know on s390 we create the ELF header in the new kernel.
Therefore we don't use the per-cpu buffers for ELF notes to store
the register state.

For RHEL7 we still store the registers in machine_kexec.c:add_elf_notes().
Though we also use the ELF header from new kernel ...

We assume your original problem with the "kmem -s" failure
was caused by the memory overwrite due to the invalid size of the
"crash_notes" per-cpu buffers.

Therefore your patch looks good for RHEL7 but for upstream we propose the
patch below.
---
[PATCH] s390/crash: Remove unused KEXEC_NOTE_BYTES

After commmit 692f66f26a4c19 ("crash: move crashkernel parsing and vmcore
related code under CONFIG_CRASH_CORE") the KEXEC_NOTE_BYTES macro is not
used anymore and for s390 we create the ELF header in the new kernel
anyway. Therefore remove the macro.

Reported-by: Xunlei Pang <xp...@redhat.com>
Reviewed-by: Mikhail Zaslonko <zaslo...@linux.vnet.ibm.com>
Signed-off-by: Michael Holzheu <holz...@linux.vnet.ibm.com>
---
 arch/s390/include/asm/kexec.h | 18 --
 include/linux/crash_core.h|  5 +
 include/linux/kexec.h |  9 -
 3 files changed, 5 insertions(+), 27 deletions(-)

diff --git a/arch/s390/include/asm/kexec.h b/arch/s390/include/asm/kexec.h
index 2f924bc30e35..dccf24ee26d3 100644
--- a/arch/s390/include/asm/kexec.h
+++ b/arch/s390/include/asm/kexec.h
@@ -41,24 +41,6 @@
 /* The native architecture */
 #define KEXEC_ARCH KEXEC_ARCH_S390
 
-/*
- * Size for s390x ELF notes per CPU
- *
- * Seven notes plus zero note at the end: prstatus, fpregset, timer,
- * tod_cmp, tod_reg, control regs, and prefix
- */
-#define KEXEC_NOTE_BYTES \
-   (ALIGN(sizeof(struct elf_note), 4) * 8 + \
-ALIGN(sizeof("CORE"), 4) * 7 + \
-ALIGN(sizeof(struct elf_prstatus), 4) + \
-ALIGN(sizeof(elf_fpregset_t), 4) + \
-ALIGN(sizeof(u64), 4) + \
-ALIGN(sizeof(u64), 4) + \
-ALIGN(sizeof(u32), 4) + \
-ALIGN(sizeof(u64) * 16, 4) + \
-ALIGN(sizeof(u32), 4) \
-   )
-
 /* Provide a dummy definition to avoid build failures. */
 static inline void crash_setup_regs(struct pt_regs *newregs,
struct pt_regs *oldregs) { }
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index 541a197ba4a2..4090a42578a8 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -10,6 +10,11 @@
 #define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(CRASH_CORE_NOTE_NAME), 4)
 #define CRASH_CORE_NOTE_DESC_BYTES ALIGN(sizeof(struct elf_prstatus), 4)
 
+/*
+ * The per-cpu notes area is a list of notes terminated by a "NULL"
+ * note header.  For kdump, the code in vmcore.c runs in the context
+ * of the second kernel to combine them into one note.
+ */
 #define CRASH_CORE_NOTE_BYTES ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  \
 CRASH_CORE_NOTE_NAME_BYTES +   \
 CRASH_CORE_NOTE_DESC_BYTES)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index c9481ebcbc0c..65888418fb69 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -63,15 +63,6 @@
 #define KEXEC_CORE_NOTE_NAME   CRASH_CORE_NOTE_NAME
 
 /*
- * The per-cpu notes area is a list of notes terminated by a "NULL"
- * note header.  For kdump, the code in vmcore.c runs in the context
- * of the second kernel to combine them into one note.
- */
-#ifndef KEXEC_NOTE_BYTES
-#define KEXEC_NOTE_BYTES   CRASH_CORE_NOTE_BYTES
-#endif
-
-/*
  * This structure is used to hold the arguments that are used when loading
  * kernel binaries.
  */
-- 
2.11.2



Re: [PATCH] s390/crash: Fix KEXEC_NOTE_BYTES definition

2017-06-21 Thread Michael Holzheu
Am Fri,  9 Jun 2017 10:17:05 +0800
schrieb Xunlei Pang :

> S390 KEXEC_NOTE_BYTES is not used by note_buf_t as before, which
> is now defined as follows:
> typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
> It was changed by the CONFIG_CRASH_CORE feature.
> 
> This patch gets rid of all the old KEXEC_NOTE_BYTES stuff, and
> renames KEXEC_NOTE_BYTES to CRASH_CORE_NOTE_BYTES for S390.
> 
> Fixes: 692f66f26a4c ("crash: move crashkernel parsing and vmcore related code 
> under CONFIG_CRASH_CORE")
> Cc: Dave Young 
> Cc: Dave Anderson 
> Cc: Hari Bathini 
> Cc: Gustavo Luiz Duarte 
> Signed-off-by: Xunlei Pang 

Hello Xunlei,

As you already know on s390 we create the ELF header in the new kernel.
Therefore we don't use the per-cpu buffers for ELF notes to store
the register state.

For RHEL7 we still store the registers in machine_kexec.c:add_elf_notes().
Though we also use the ELF header from new kernel ...

We assume your original problem with the "kmem -s" failure
was caused by the memory overwrite due to the invalid size of the
"crash_notes" per-cpu buffers.

Therefore your patch looks good for RHEL7 but for upstream we propose the
patch below.
---
[PATCH] s390/crash: Remove unused KEXEC_NOTE_BYTES

After commmit 692f66f26a4c19 ("crash: move crashkernel parsing and vmcore
related code under CONFIG_CRASH_CORE") the KEXEC_NOTE_BYTES macro is not
used anymore and for s390 we create the ELF header in the new kernel
anyway. Therefore remove the macro.

Reported-by: Xunlei Pang 
Reviewed-by: Mikhail Zaslonko 
Signed-off-by: Michael Holzheu 
---
 arch/s390/include/asm/kexec.h | 18 --
 include/linux/crash_core.h|  5 +
 include/linux/kexec.h |  9 -
 3 files changed, 5 insertions(+), 27 deletions(-)

diff --git a/arch/s390/include/asm/kexec.h b/arch/s390/include/asm/kexec.h
index 2f924bc30e35..dccf24ee26d3 100644
--- a/arch/s390/include/asm/kexec.h
+++ b/arch/s390/include/asm/kexec.h
@@ -41,24 +41,6 @@
 /* The native architecture */
 #define KEXEC_ARCH KEXEC_ARCH_S390
 
-/*
- * Size for s390x ELF notes per CPU
- *
- * Seven notes plus zero note at the end: prstatus, fpregset, timer,
- * tod_cmp, tod_reg, control regs, and prefix
- */
-#define KEXEC_NOTE_BYTES \
-   (ALIGN(sizeof(struct elf_note), 4) * 8 + \
-ALIGN(sizeof("CORE"), 4) * 7 + \
-ALIGN(sizeof(struct elf_prstatus), 4) + \
-ALIGN(sizeof(elf_fpregset_t), 4) + \
-ALIGN(sizeof(u64), 4) + \
-ALIGN(sizeof(u64), 4) + \
-ALIGN(sizeof(u32), 4) + \
-ALIGN(sizeof(u64) * 16, 4) + \
-ALIGN(sizeof(u32), 4) \
-   )
-
 /* Provide a dummy definition to avoid build failures. */
 static inline void crash_setup_regs(struct pt_regs *newregs,
struct pt_regs *oldregs) { }
diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index 541a197ba4a2..4090a42578a8 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -10,6 +10,11 @@
 #define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(CRASH_CORE_NOTE_NAME), 4)
 #define CRASH_CORE_NOTE_DESC_BYTES ALIGN(sizeof(struct elf_prstatus), 4)
 
+/*
+ * The per-cpu notes area is a list of notes terminated by a "NULL"
+ * note header.  For kdump, the code in vmcore.c runs in the context
+ * of the second kernel to combine them into one note.
+ */
 #define CRASH_CORE_NOTE_BYTES ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  \
 CRASH_CORE_NOTE_NAME_BYTES +   \
 CRASH_CORE_NOTE_DESC_BYTES)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index c9481ebcbc0c..65888418fb69 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -63,15 +63,6 @@
 #define KEXEC_CORE_NOTE_NAME   CRASH_CORE_NOTE_NAME
 
 /*
- * The per-cpu notes area is a list of notes terminated by a "NULL"
- * note header.  For kdump, the code in vmcore.c runs in the context
- * of the second kernel to combine them into one note.
- */
-#ifndef KEXEC_NOTE_BYTES
-#define KEXEC_NOTE_BYTES   CRASH_CORE_NOTE_BYTES
-#endif
-
-/*
  * This structure is used to hold the arguments that are used when loading
  * kernel binaries.
  */
-- 
2.11.2



Re: [PATCH] s390/crash: Fix KEXEC_NOTE_BYTES definition

2017-06-21 Thread Michael Holzheu
Hi Xunlei,

Sorry for the late reply - I was on vacation up to now.
Give us some time to look into this issue.

Michael

Am Fri,  9 Jun 2017 10:17:05 +0800
schrieb Xunlei Pang :

> S390 KEXEC_NOTE_BYTES is not used by note_buf_t as before, which
> is now defined as follows:
> typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
> It was changed by the CONFIG_CRASH_CORE feature.
> 
> This patch gets rid of all the old KEXEC_NOTE_BYTES stuff, and
> renames KEXEC_NOTE_BYTES to CRASH_CORE_NOTE_BYTES for S390.
> 
> Fixes: 692f66f26a4c ("crash: move crashkernel parsing and vmcore related code 
> under CONFIG_CRASH_CORE")
> Cc: Dave Young 
> Cc: Dave Anderson 
> Cc: Hari Bathini 
> Cc: Gustavo Luiz Duarte 
> Signed-off-by: Xunlei Pang 
> ---
>  arch/s390/include/asm/kexec.h |  2 +-
>  include/linux/crash_core.h|  7 +++
>  include/linux/kexec.h | 11 +--
>  3 files changed, 9 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/s390/include/asm/kexec.h b/arch/s390/include/asm/kexec.h
> index 2f924bc..352deb8 100644
> --- a/arch/s390/include/asm/kexec.h
> +++ b/arch/s390/include/asm/kexec.h
> @@ -47,7 +47,7 @@
>   * Seven notes plus zero note at the end: prstatus, fpregset, timer,
>   * tod_cmp, tod_reg, control regs, and prefix
>   */
> -#define KEXEC_NOTE_BYTES \
> +#define CRASH_CORE_NOTE_BYTES \
>   (ALIGN(sizeof(struct elf_note), 4) * 8 + \
>ALIGN(sizeof("CORE"), 4) * 7 + \
>ALIGN(sizeof(struct elf_prstatus), 4) + \
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index e9de6b4..dbc6e5c 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -10,9 +10,16 @@
>  #define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(CRASH_CORE_NOTE_NAME), 4)
>  #define CRASH_CORE_NOTE_DESC_BYTES ALIGN(sizeof(struct elf_prstatus), 4)
> 
> +/*
> + * The per-cpu notes area is a list of notes terminated by a "NULL"
> + * note header.  For kdump, the code in vmcore.c runs in the context
> + * of the second kernel to combine them into one note.
> + */
> +#ifndef CRASH_CORE_NOTE_BYTES
>  #define CRASH_CORE_NOTE_BYTES   ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  
> \
>CRASH_CORE_NOTE_NAME_BYTES +   \
>CRASH_CORE_NOTE_DESC_BYTES)
> +#endif
> 
>  #define VMCOREINFO_BYTESPAGE_SIZE
>  #define VMCOREINFO_NOTE_NAME"VMCOREINFO"
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 3ea8275..133df03 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -14,7 +14,6 @@
> 
>  #if !defined(__ASSEMBLY__)
> 
> -#include 
>  #include 
> 
>  #include 
> @@ -25,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  /* Verify architecture specific macros are defined */
> 
> @@ -63,15 +63,6 @@
>  #define KEXEC_CORE_NOTE_NAME CRASH_CORE_NOTE_NAME
> 
>  /*
> - * The per-cpu notes area is a list of notes terminated by a "NULL"
> - * note header.  For kdump, the code in vmcore.c runs in the context
> - * of the second kernel to combine them into one note.
> - */
> -#ifndef KEXEC_NOTE_BYTES
> -#define KEXEC_NOTE_BYTES CRASH_CORE_NOTE_BYTES
> -#endif
> -
> -/*
>   * This structure is used to hold the arguments that are used when loading
>   * kernel binaries.
>   */



Re: [PATCH] s390/crash: Fix KEXEC_NOTE_BYTES definition

2017-06-21 Thread Michael Holzheu
Hi Xunlei,

Sorry for the late reply - I was on vacation up to now.
Give us some time to look into this issue.

Michael

Am Fri,  9 Jun 2017 10:17:05 +0800
schrieb Xunlei Pang :

> S390 KEXEC_NOTE_BYTES is not used by note_buf_t as before, which
> is now defined as follows:
> typedef u32 note_buf_t[CRASH_CORE_NOTE_BYTES/4];
> It was changed by the CONFIG_CRASH_CORE feature.
> 
> This patch gets rid of all the old KEXEC_NOTE_BYTES stuff, and
> renames KEXEC_NOTE_BYTES to CRASH_CORE_NOTE_BYTES for S390.
> 
> Fixes: 692f66f26a4c ("crash: move crashkernel parsing and vmcore related code 
> under CONFIG_CRASH_CORE")
> Cc: Dave Young 
> Cc: Dave Anderson 
> Cc: Hari Bathini 
> Cc: Gustavo Luiz Duarte 
> Signed-off-by: Xunlei Pang 
> ---
>  arch/s390/include/asm/kexec.h |  2 +-
>  include/linux/crash_core.h|  7 +++
>  include/linux/kexec.h | 11 +--
>  3 files changed, 9 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/s390/include/asm/kexec.h b/arch/s390/include/asm/kexec.h
> index 2f924bc..352deb8 100644
> --- a/arch/s390/include/asm/kexec.h
> +++ b/arch/s390/include/asm/kexec.h
> @@ -47,7 +47,7 @@
>   * Seven notes plus zero note at the end: prstatus, fpregset, timer,
>   * tod_cmp, tod_reg, control regs, and prefix
>   */
> -#define KEXEC_NOTE_BYTES \
> +#define CRASH_CORE_NOTE_BYTES \
>   (ALIGN(sizeof(struct elf_note), 4) * 8 + \
>ALIGN(sizeof("CORE"), 4) * 7 + \
>ALIGN(sizeof(struct elf_prstatus), 4) + \
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index e9de6b4..dbc6e5c 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -10,9 +10,16 @@
>  #define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(CRASH_CORE_NOTE_NAME), 4)
>  #define CRASH_CORE_NOTE_DESC_BYTES ALIGN(sizeof(struct elf_prstatus), 4)
> 
> +/*
> + * The per-cpu notes area is a list of notes terminated by a "NULL"
> + * note header.  For kdump, the code in vmcore.c runs in the context
> + * of the second kernel to combine them into one note.
> + */
> +#ifndef CRASH_CORE_NOTE_BYTES
>  #define CRASH_CORE_NOTE_BYTES   ((CRASH_CORE_NOTE_HEAD_BYTES * 2) +  
> \
>CRASH_CORE_NOTE_NAME_BYTES +   \
>CRASH_CORE_NOTE_DESC_BYTES)
> +#endif
> 
>  #define VMCOREINFO_BYTESPAGE_SIZE
>  #define VMCOREINFO_NOTE_NAME"VMCOREINFO"
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 3ea8275..133df03 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -14,7 +14,6 @@
> 
>  #if !defined(__ASSEMBLY__)
> 
> -#include 
>  #include 
> 
>  #include 
> @@ -25,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  /* Verify architecture specific macros are defined */
> 
> @@ -63,15 +63,6 @@
>  #define KEXEC_CORE_NOTE_NAME CRASH_CORE_NOTE_NAME
> 
>  /*
> - * The per-cpu notes area is a list of notes terminated by a "NULL"
> - * note header.  For kdump, the code in vmcore.c runs in the context
> - * of the second kernel to combine them into one note.
> - */
> -#ifndef KEXEC_NOTE_BYTES
> -#define KEXEC_NOTE_BYTES CRASH_CORE_NOTE_BYTES
> -#endif
> -
> -/*
>   * This structure is used to hold the arguments that are used when loading
>   * kernel binaries.
>   */



Re: [PATCH v4 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-04-24 Thread Michael Holzheu
Am Thu, 20 Apr 2017 19:39:32 +0800
schrieb Xunlei Pang <xlp...@redhat.com>:

> As Eric said,
> "what we need to do is move the variable vmcoreinfo_note out
> of the kernel's .bss section.  And modify the code to regenerate
> and keep this information in something like the control page.
> 
> Definitely something like this needs a page all to itself, and ideally
> far away from any other kernel data structures.  I clearly was not
> watching closely the data someone decided to keep this silly thing
> in the kernel's .bss section."
> 
> This patch allocates extra pages for these vmcoreinfo_XXX variables,
> one advantage is that it enhances some safety of vmcoreinfo, because
> vmcoreinfo now is kept far away from other kernel data structures.
> 
> Suggested-by: Eric Biederman <ebied...@xmission.com>
> Cc: Michael Holzheu <holz...@linux.vnet.ibm.com>
> Cc: Juergen Gross <jgr...@suse.com>
> Signed-off-by: Xunlei Pang <xlp...@redhat.com>

Now s390 seems to work with the patches. I tested kdump and
zfcpdump.

Tested-by: Michael Holzheu <holz...@linux.vnet.ibm.com>


> ---
> v3->v4:
> -Rebased on the latest linux-next
> -Handle S390 vmcoreinfo_note properly
> -Handle the newly-added xen/mmu_pv.c
> 
>  arch/ia64/kernel/machine_kexec.c |  5 -
>  arch/s390/kernel/machine_kexec.c |  1 +
>  arch/s390/kernel/setup.c |  6 --
>  arch/x86/kernel/crash.c  |  2 +-
>  arch/x86/xen/mmu_pv.c|  4 ++--
>  include/linux/crash_core.h   |  2 +-
>  kernel/crash_core.c  | 27 +++
>  kernel/ksysfs.c  |  2 +-
>  8 files changed, 29 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/ia64/kernel/machine_kexec.c 
> b/arch/ia64/kernel/machine_kexec.c
> index 599507b..c14815d 100644
> --- a/arch/ia64/kernel/machine_kexec.c
> +++ b/arch/ia64/kernel/machine_kexec.c
> @@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
>  #endif
>  }
> 
> -phys_addr_t paddr_vmcoreinfo_note(void)
> -{
> - return ia64_tpa((unsigned long)(char *)_note);
> -}
> -
> diff --git a/arch/s390/kernel/machine_kexec.c 
> b/arch/s390/kernel/machine_kexec.c
> index 49a6bd4..3d0b14a 100644
> --- a/arch/s390/kernel/machine_kexec.c
> +++ b/arch/s390/kernel/machine_kexec.c
> @@ -246,6 +246,7 @@ void arch_crash_save_vmcoreinfo(void)
>   VMCOREINFO_SYMBOL(lowcore_ptr);
>   VMCOREINFO_SYMBOL(high_memory);
>   VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
> + mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
>  }
> 
>  void machine_shutdown(void)
> diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
> index 3ae756c..3d1d808 100644
> --- a/arch/s390/kernel/setup.c
> +++ b/arch/s390/kernel/setup.c
> @@ -496,11 +496,6 @@ static void __init setup_memory_end(void)
>   pr_notice("The maximum memory size is %luMB\n", memory_end >> 20);
>  }
> 
> -static void __init setup_vmcoreinfo(void)
> -{
> - mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
> -}
> -
>  #ifdef CONFIG_CRASH_DUMP
> 
>  /*
> @@ -939,7 +934,6 @@ void __init setup_arch(char **cmdline_p)
>  #endif
> 
>   setup_resources();
> - setup_vmcoreinfo();
>   setup_lowcore();
>   smp_fill_possible_mask();
>   cpu_detect_mhz_feature();
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 22217ec..44404e2 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -457,7 +457,7 @@ static int prepare_elf64_headers(struct crash_elf_data 
> *ced,
>   bufp += sizeof(Elf64_Phdr);
>   phdr->p_type = PT_NOTE;
>   phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
> - phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
> + phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
>   (ehdr->e_phnum)++;
> 
>  #ifdef CONFIG_X86_64
> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index 9d9ae66..35543fa 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -2723,8 +2723,8 @@ void xen_destroy_contiguous_region(phys_addr_t pstart, 
> unsigned int order)
>  phys_addr_t paddr_vmcoreinfo_note(void)
>  {
>   if (xen_pv_domain())
> - return virt_to_machine(_note).maddr;
> + return virt_to_machine(vmcoreinfo_note).maddr;
>   else
> - return __pa_symbol(_note);
> + return __pa(vmcoreinfo_note);
>  }
>  #endif /* CONFIG_KEXEC_CORE */
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index eb71a70..ba283a2 100644
> --- a/include/linux/crash

Re: [PATCH v4 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-04-24 Thread Michael Holzheu
Am Thu, 20 Apr 2017 19:39:32 +0800
schrieb Xunlei Pang :

> As Eric said,
> "what we need to do is move the variable vmcoreinfo_note out
> of the kernel's .bss section.  And modify the code to regenerate
> and keep this information in something like the control page.
> 
> Definitely something like this needs a page all to itself, and ideally
> far away from any other kernel data structures.  I clearly was not
> watching closely the data someone decided to keep this silly thing
> in the kernel's .bss section."
> 
> This patch allocates extra pages for these vmcoreinfo_XXX variables,
> one advantage is that it enhances some safety of vmcoreinfo, because
> vmcoreinfo now is kept far away from other kernel data structures.
> 
> Suggested-by: Eric Biederman 
> Cc: Michael Holzheu 
> Cc: Juergen Gross 
> Signed-off-by: Xunlei Pang 

Now s390 seems to work with the patches. I tested kdump and
zfcpdump.

Tested-by: Michael Holzheu 


> ---
> v3->v4:
> -Rebased on the latest linux-next
> -Handle S390 vmcoreinfo_note properly
> -Handle the newly-added xen/mmu_pv.c
> 
>  arch/ia64/kernel/machine_kexec.c |  5 -
>  arch/s390/kernel/machine_kexec.c |  1 +
>  arch/s390/kernel/setup.c |  6 --
>  arch/x86/kernel/crash.c  |  2 +-
>  arch/x86/xen/mmu_pv.c|  4 ++--
>  include/linux/crash_core.h   |  2 +-
>  kernel/crash_core.c  | 27 +++
>  kernel/ksysfs.c  |  2 +-
>  8 files changed, 29 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/ia64/kernel/machine_kexec.c 
> b/arch/ia64/kernel/machine_kexec.c
> index 599507b..c14815d 100644
> --- a/arch/ia64/kernel/machine_kexec.c
> +++ b/arch/ia64/kernel/machine_kexec.c
> @@ -163,8 +163,3 @@ void arch_crash_save_vmcoreinfo(void)
>  #endif
>  }
> 
> -phys_addr_t paddr_vmcoreinfo_note(void)
> -{
> - return ia64_tpa((unsigned long)(char *)_note);
> -}
> -
> diff --git a/arch/s390/kernel/machine_kexec.c 
> b/arch/s390/kernel/machine_kexec.c
> index 49a6bd4..3d0b14a 100644
> --- a/arch/s390/kernel/machine_kexec.c
> +++ b/arch/s390/kernel/machine_kexec.c
> @@ -246,6 +246,7 @@ void arch_crash_save_vmcoreinfo(void)
>   VMCOREINFO_SYMBOL(lowcore_ptr);
>   VMCOREINFO_SYMBOL(high_memory);
>   VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
> + mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
>  }
> 
>  void machine_shutdown(void)
> diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
> index 3ae756c..3d1d808 100644
> --- a/arch/s390/kernel/setup.c
> +++ b/arch/s390/kernel/setup.c
> @@ -496,11 +496,6 @@ static void __init setup_memory_end(void)
>   pr_notice("The maximum memory size is %luMB\n", memory_end >> 20);
>  }
> 
> -static void __init setup_vmcoreinfo(void)
> -{
> - mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
> -}
> -
>  #ifdef CONFIG_CRASH_DUMP
> 
>  /*
> @@ -939,7 +934,6 @@ void __init setup_arch(char **cmdline_p)
>  #endif
> 
>   setup_resources();
> - setup_vmcoreinfo();
>   setup_lowcore();
>   smp_fill_possible_mask();
>   cpu_detect_mhz_feature();
> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index 22217ec..44404e2 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -457,7 +457,7 @@ static int prepare_elf64_headers(struct crash_elf_data 
> *ced,
>   bufp += sizeof(Elf64_Phdr);
>   phdr->p_type = PT_NOTE;
>   phdr->p_offset = phdr->p_paddr = paddr_vmcoreinfo_note();
> - phdr->p_filesz = phdr->p_memsz = sizeof(vmcoreinfo_note);
> + phdr->p_filesz = phdr->p_memsz = VMCOREINFO_NOTE_SIZE;
>   (ehdr->e_phnum)++;
> 
>  #ifdef CONFIG_X86_64
> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
> index 9d9ae66..35543fa 100644
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -2723,8 +2723,8 @@ void xen_destroy_contiguous_region(phys_addr_t pstart, 
> unsigned int order)
>  phys_addr_t paddr_vmcoreinfo_note(void)
>  {
>   if (xen_pv_domain())
> - return virt_to_machine(_note).maddr;
> + return virt_to_machine(vmcoreinfo_note).maddr;
>   else
> - return __pa_symbol(_note);
> + return __pa(vmcoreinfo_note);
>  }
>  #endif /* CONFIG_KEXEC_CORE */
> diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
> index eb71a70..ba283a2 100644
> --- a/include/linux/crash_core.h
> +++ b/include/linux/crash_core.h
> @@ -53,7 +53,7 @@
>  #define VMCOREINFO_PHYS_BASE(value) \
>   vmcoreinfo_append_str("PHYS_BASE=%lx\n", (unsig

Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-23 Thread Michael Holzheu
Am Thu, 23 Mar 2017 17:23:53 +0800
schrieb Xunlei Pang <xp...@redhat.com>:

> On 03/23/2017 at 04:48 AM, Michael Holzheu wrote:
> > Am Wed, 22 Mar 2017 12:30:04 +0800
> > schrieb Dave Young <dyo...@redhat.com>:
> >
> >> On 03/21/17 at 10:18pm, Eric W. Biederman wrote:
> >>> Dave Young <dyo...@redhat.com> writes:
> >>>
> > [snip]
> >
> >>>> I think makedumpfile is using it, but I also vote to remove the
> >>>> CRASHTIME. It is better not to do this while crashing and a makedumpfile
> >>>> userspace patch is needed to drop the use of it.
> >>>>
> >>>>> As we are looking at reliability concerns removing CRASHTIME should make
> >>>>> everything in vmcoreinfo a boot time constant.  Which should simplify
> >>>>> everything considerably.
> >>>> It is a nice improvement..
> >>> We also need to take a close look at what s390 is doing with vmcoreinfo.
> >>> As apparently it is reading it in a different kind of crashdump process.
> >> Yes, need careful review from s390 and maybe ppc64 especially about
> >> patch 2/3, better to have comments from IBM about s390 dump tool and ppc
> >> fadump. Added more cc.
> > On s390 we have at least an issue with patch 1/3. For stand-alone dump
> > and also because we create the ELF header for kdump in the new
> > kernel we save the pointer to the vmcoreinfo note in the old kernel on a
> > defined memory address in our absolute zero lowcore.
> >
> > This is done in arch/s390/kernel/setup.c:
> >
> > static void __init setup_vmcoreinfo(void)
> > {
> > mem_assign_absolute(S390_lowcore.vmcore_info, 
> > paddr_vmcoreinfo_note());
> > }
> >
> > Since with patch 1/3 paddr_vmcoreinfo_note() returns NULL at this point in
> > time we have a problem here.
> >
> > To solve this - I think - we could move the initialization to
> > arch/s390/kernel/machine_kexec.c:
> >
> > void arch_crash_save_vmcoreinfo(void)
> > {
> > VMCOREINFO_SYMBOL(lowcore_ptr);
> > VMCOREINFO_SYMBOL(high_memory);
> > VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
> > mem_assign_absolute(S390_lowcore.vmcore_info, 
> > paddr_vmcoreinfo_note());
> > }
> >
> > Probably related to this is my observation that patch 3/3 leads to
> > an empty VMCOREINFO note for kdump on s390. The note is there ...
> >
> > # readelf -n /var/crash/127.0.0.1-2017-03-22-21:14:39/vmcore | grep VMCORE
> >   VMCOREINFO   0x068e   Unknown note type: (0x)
> >
> > But it contains only zeros.
> 
> Yes, this is a good catch, I will do more tests.

Hello Xunlei,

After spending some time on this, I now understood the problem:

In patch 3/3 you copy vmcoreinfo into the control page before
machine_kexec_prepare() is called. For s390 we give back all the
crashkernel memory to the hypervisor before the new crashkernel
is loaded:

/*
 * Give back memory to hypervisor before new kdump is loaded
 */
static int machine_kexec_prepare_kdump(void)
{
#ifdef CONFIG_CRASH_DUMP
if (MACHINE_IS_VM)
diag10_range(PFN_DOWN(crashk_res.start),
 PFN_DOWN(crashk_res.end - crashk_res.start + 1));
return 0;
#else
return -EINVAL;
#endif
}

So after machine_kexec_prepare_kdump() the contents of your control page
is gone and therefore the vmcorinfo ELF note contains only zeros.

If you call kimage_crash_copy_vmcoreinfo() after
machine_kexec_prepare_kdump() the problem should be solved for s390.

Regards
Michael



Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-23 Thread Michael Holzheu
Am Thu, 23 Mar 2017 17:23:53 +0800
schrieb Xunlei Pang :

> On 03/23/2017 at 04:48 AM, Michael Holzheu wrote:
> > Am Wed, 22 Mar 2017 12:30:04 +0800
> > schrieb Dave Young :
> >
> >> On 03/21/17 at 10:18pm, Eric W. Biederman wrote:
> >>> Dave Young  writes:
> >>>
> > [snip]
> >
> >>>> I think makedumpfile is using it, but I also vote to remove the
> >>>> CRASHTIME. It is better not to do this while crashing and a makedumpfile
> >>>> userspace patch is needed to drop the use of it.
> >>>>
> >>>>> As we are looking at reliability concerns removing CRASHTIME should make
> >>>>> everything in vmcoreinfo a boot time constant.  Which should simplify
> >>>>> everything considerably.
> >>>> It is a nice improvement..
> >>> We also need to take a close look at what s390 is doing with vmcoreinfo.
> >>> As apparently it is reading it in a different kind of crashdump process.
> >> Yes, need careful review from s390 and maybe ppc64 especially about
> >> patch 2/3, better to have comments from IBM about s390 dump tool and ppc
> >> fadump. Added more cc.
> > On s390 we have at least an issue with patch 1/3. For stand-alone dump
> > and also because we create the ELF header for kdump in the new
> > kernel we save the pointer to the vmcoreinfo note in the old kernel on a
> > defined memory address in our absolute zero lowcore.
> >
> > This is done in arch/s390/kernel/setup.c:
> >
> > static void __init setup_vmcoreinfo(void)
> > {
> > mem_assign_absolute(S390_lowcore.vmcore_info, 
> > paddr_vmcoreinfo_note());
> > }
> >
> > Since with patch 1/3 paddr_vmcoreinfo_note() returns NULL at this point in
> > time we have a problem here.
> >
> > To solve this - I think - we could move the initialization to
> > arch/s390/kernel/machine_kexec.c:
> >
> > void arch_crash_save_vmcoreinfo(void)
> > {
> > VMCOREINFO_SYMBOL(lowcore_ptr);
> > VMCOREINFO_SYMBOL(high_memory);
> > VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
> > mem_assign_absolute(S390_lowcore.vmcore_info, 
> > paddr_vmcoreinfo_note());
> > }
> >
> > Probably related to this is my observation that patch 3/3 leads to
> > an empty VMCOREINFO note for kdump on s390. The note is there ...
> >
> > # readelf -n /var/crash/127.0.0.1-2017-03-22-21:14:39/vmcore | grep VMCORE
> >   VMCOREINFO   0x068e   Unknown note type: (0x)
> >
> > But it contains only zeros.
> 
> Yes, this is a good catch, I will do more tests.

Hello Xunlei,

After spending some time on this, I now understood the problem:

In patch 3/3 you copy vmcoreinfo into the control page before
machine_kexec_prepare() is called. For s390 we give back all the
crashkernel memory to the hypervisor before the new crashkernel
is loaded:

/*
 * Give back memory to hypervisor before new kdump is loaded
 */
static int machine_kexec_prepare_kdump(void)
{
#ifdef CONFIG_CRASH_DUMP
if (MACHINE_IS_VM)
diag10_range(PFN_DOWN(crashk_res.start),
 PFN_DOWN(crashk_res.end - crashk_res.start + 1));
return 0;
#else
return -EINVAL;
#endif
}

So after machine_kexec_prepare_kdump() the contents of your control page
is gone and therefore the vmcorinfo ELF note contains only zeros.

If you call kimage_crash_copy_vmcoreinfo() after
machine_kexec_prepare_kdump() the problem should be solved for s390.

Regards
Michael



Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-22 Thread Michael Holzheu
Am Wed, 22 Mar 2017 12:30:04 +0800
schrieb Dave Young :

> On 03/21/17 at 10:18pm, Eric W. Biederman wrote:
> > Dave Young  writes:
> > 

[snip]

> > > I think makedumpfile is using it, but I also vote to remove the
> > > CRASHTIME. It is better not to do this while crashing and a makedumpfile
> > > userspace patch is needed to drop the use of it.
> > >
> > >> 
> > >> As we are looking at reliability concerns removing CRASHTIME should make
> > >> everything in vmcoreinfo a boot time constant.  Which should simplify
> > >> everything considerably.
> > >
> > > It is a nice improvement..
> > 
> > We also need to take a close look at what s390 is doing with vmcoreinfo.
> > As apparently it is reading it in a different kind of crashdump process.
> 
> Yes, need careful review from s390 and maybe ppc64 especially about
> patch 2/3, better to have comments from IBM about s390 dump tool and ppc
> fadump. Added more cc.

On s390 we have at least an issue with patch 1/3. For stand-alone dump
and also because we create the ELF header for kdump in the new
kernel we save the pointer to the vmcoreinfo note in the old kernel on a
defined memory address in our absolute zero lowcore.

This is done in arch/s390/kernel/setup.c:

static void __init setup_vmcoreinfo(void)
{
mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
}

Since with patch 1/3 paddr_vmcoreinfo_note() returns NULL at this point in
time we have a problem here.

To solve this - I think - we could move the initialization to
arch/s390/kernel/machine_kexec.c:

void arch_crash_save_vmcoreinfo(void)
{
VMCOREINFO_SYMBOL(lowcore_ptr);
VMCOREINFO_SYMBOL(high_memory);
VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
}

Probably related to this is my observation that patch 3/3 leads to
an empty VMCOREINFO note for kdump on s390. The note is there ...

# readelf -n /var/crash/127.0.0.1-2017-03-22-21:14:39/vmcore | grep VMCORE
  VMCOREINFO   0x068e   Unknown note type: (0x)

But it contains only zeros.

Unfortunately I have not yet understood the reason for this.

Michael



Re: [PATCH v3 1/3] kexec: Move vmcoreinfo out of the kernel's .bss section

2017-03-22 Thread Michael Holzheu
Am Wed, 22 Mar 2017 12:30:04 +0800
schrieb Dave Young :

> On 03/21/17 at 10:18pm, Eric W. Biederman wrote:
> > Dave Young  writes:
> > 

[snip]

> > > I think makedumpfile is using it, but I also vote to remove the
> > > CRASHTIME. It is better not to do this while crashing and a makedumpfile
> > > userspace patch is needed to drop the use of it.
> > >
> > >> 
> > >> As we are looking at reliability concerns removing CRASHTIME should make
> > >> everything in vmcoreinfo a boot time constant.  Which should simplify
> > >> everything considerably.
> > >
> > > It is a nice improvement..
> > 
> > We also need to take a close look at what s390 is doing with vmcoreinfo.
> > As apparently it is reading it in a different kind of crashdump process.
> 
> Yes, need careful review from s390 and maybe ppc64 especially about
> patch 2/3, better to have comments from IBM about s390 dump tool and ppc
> fadump. Added more cc.

On s390 we have at least an issue with patch 1/3. For stand-alone dump
and also because we create the ELF header for kdump in the new
kernel we save the pointer to the vmcoreinfo note in the old kernel on a
defined memory address in our absolute zero lowcore.

This is done in arch/s390/kernel/setup.c:

static void __init setup_vmcoreinfo(void)
{
mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
}

Since with patch 1/3 paddr_vmcoreinfo_note() returns NULL at this point in
time we have a problem here.

To solve this - I think - we could move the initialization to
arch/s390/kernel/machine_kexec.c:

void arch_crash_save_vmcoreinfo(void)
{
VMCOREINFO_SYMBOL(lowcore_ptr);
VMCOREINFO_SYMBOL(high_memory);
VMCOREINFO_LENGTH(lowcore_ptr, NR_CPUS);
mem_assign_absolute(S390_lowcore.vmcore_info, paddr_vmcoreinfo_note());
}

Probably related to this is my observation that patch 3/3 leads to
an empty VMCOREINFO note for kdump on s390. The note is there ...

# readelf -n /var/crash/127.0.0.1-2017-03-22-21:14:39/vmcore | grep VMCORE
  VMCOREINFO   0x068e   Unknown note type: (0x)

But it contains only zeros.

Unfortunately I have not yet understood the reason for this.

Michael



[PATCH] bpf/samples: Fix PT_REGS_IP on s390x and use it

2016-11-23 Thread Michael Holzheu
The files "sampleip_kern.c" and "trace_event_kern.c" directly access
"ctx->regs.ip" which is not available on s390x. Fix this and use the
PT_REGS_IP() macro instead.

Besides of that also fix the macro for s390x and use psw.addr from pt_regs.

Reported-by: Zvonko Kosic <zvonko.ko...@de.ibm.com>
Signed-off-by: Michael Holzheu <holz...@linux.vnet.ibm.com>
---
 samples/bpf/bpf_helpers.h  | 2 +-
 samples/bpf/sampleip_kern.c| 2 +-
 samples/bpf/trace_event_kern.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 90f44bd..dadd516 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -113,7 +113,7 @@ static int (*bpf_skb_under_cgroup)(void *ctx, void *map, 
int index) =
 #define PT_REGS_FP(x) ((x)->gprs[11]) /* Works only with CONFIG_FRAME_POINTER 
*/
 #define PT_REGS_RC(x) ((x)->gprs[2])
 #define PT_REGS_SP(x) ((x)->gprs[15])
-#define PT_REGS_IP(x) ((x)->ip)
+#define PT_REGS_IP(x) ((x)->psw.addr)
 
 #elif defined(__aarch64__)
 
diff --git a/samples/bpf/sampleip_kern.c b/samples/bpf/sampleip_kern.c
index 774a681..ceabf31 100644
--- a/samples/bpf/sampleip_kern.c
+++ b/samples/bpf/sampleip_kern.c
@@ -25,7 +25,7 @@ int do_sample(struct bpf_perf_event_data *ctx)
u64 ip;
u32 *value, init_val = 1;
 
-   ip = ctx->regs.ip;
+   ip = PT_REGS_IP(>regs);
value = bpf_map_lookup_elem(_map, );
if (value)
*value += 1;
diff --git a/samples/bpf/trace_event_kern.c b/samples/bpf/trace_event_kern.c
index 71a8ed3..41b6115 100644
--- a/samples/bpf/trace_event_kern.c
+++ b/samples/bpf/trace_event_kern.c
@@ -50,7 +50,7 @@ int bpf_prog1(struct bpf_perf_event_data *ctx)
key.userstack = bpf_get_stackid(ctx, , USER_STACKID_FLAGS);
if ((int)key.kernstack < 0 && (int)key.userstack < 0) {
bpf_trace_printk(fmt, sizeof(fmt), cpu, ctx->sample_period,
-ctx->regs.ip);
+PT_REGS_IP(>regs));
return 0;
}
 
-- 
2.8.4



[PATCH] bpf/samples: Fix PT_REGS_IP on s390x and use it

2016-11-23 Thread Michael Holzheu
The files "sampleip_kern.c" and "trace_event_kern.c" directly access
"ctx->regs.ip" which is not available on s390x. Fix this and use the
PT_REGS_IP() macro instead.

Besides of that also fix the macro for s390x and use psw.addr from pt_regs.

Reported-by: Zvonko Kosic 
Signed-off-by: Michael Holzheu 
---
 samples/bpf/bpf_helpers.h  | 2 +-
 samples/bpf/sampleip_kern.c| 2 +-
 samples/bpf/trace_event_kern.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/bpf_helpers.h b/samples/bpf/bpf_helpers.h
index 90f44bd..dadd516 100644
--- a/samples/bpf/bpf_helpers.h
+++ b/samples/bpf/bpf_helpers.h
@@ -113,7 +113,7 @@ static int (*bpf_skb_under_cgroup)(void *ctx, void *map, 
int index) =
 #define PT_REGS_FP(x) ((x)->gprs[11]) /* Works only with CONFIG_FRAME_POINTER 
*/
 #define PT_REGS_RC(x) ((x)->gprs[2])
 #define PT_REGS_SP(x) ((x)->gprs[15])
-#define PT_REGS_IP(x) ((x)->ip)
+#define PT_REGS_IP(x) ((x)->psw.addr)
 
 #elif defined(__aarch64__)
 
diff --git a/samples/bpf/sampleip_kern.c b/samples/bpf/sampleip_kern.c
index 774a681..ceabf31 100644
--- a/samples/bpf/sampleip_kern.c
+++ b/samples/bpf/sampleip_kern.c
@@ -25,7 +25,7 @@ int do_sample(struct bpf_perf_event_data *ctx)
u64 ip;
u32 *value, init_val = 1;
 
-   ip = ctx->regs.ip;
+   ip = PT_REGS_IP(>regs);
value = bpf_map_lookup_elem(_map, );
if (value)
*value += 1;
diff --git a/samples/bpf/trace_event_kern.c b/samples/bpf/trace_event_kern.c
index 71a8ed3..41b6115 100644
--- a/samples/bpf/trace_event_kern.c
+++ b/samples/bpf/trace_event_kern.c
@@ -50,7 +50,7 @@ int bpf_prog1(struct bpf_perf_event_data *ctx)
key.userstack = bpf_get_stackid(ctx, , USER_STACKID_FLAGS);
if ((int)key.kernstack < 0 && (int)key.userstack < 0) {
bpf_trace_printk(fmt, sizeof(fmt), cpu, ctx->sample_period,
-ctx->regs.ip);
+PT_REGS_IP(>regs));
return 0;
}
 
-- 
2.8.4



Re: [PATCH] s390/hypfs: Use kmalloc_array() in diag0c_store()

2016-09-01 Thread Michael Holzheu
Am Thu, 1 Sep 2016 17:39:02 +0200
schrieb Paolo Bonzini :

> 
> 
> On 01/09/2016 12:32, Heiko Carstens wrote:
> > On Thu, Sep 01, 2016 at 11:38:15AM +0200, SF Markus Elfring wrote:
> >> From: Markus Elfring 
> >> Date: Thu, 1 Sep 2016 11:30:58 +0200
> >>
> >> A multiplication for the size determination of a memory allocation
> >> indicated that an array data structure should be processed.
> >> Thus use the corresponding function "kmalloc_array".
> >>
> >> This issue was detected by using the Coccinelle software.
> >>
> >> Signed-off-by: Markus Elfring 
> >> ---
> >>  arch/s390/hypfs/hypfs_diag0c.c | 4 +++-
> >>  1 file changed, 3 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/s390/hypfs/hypfs_diag0c.c
> >> b/arch/s390/hypfs/hypfs_diag0c.c index 0f1927c..61418a8 100644
> >> --- a/arch/s390/hypfs/hypfs_diag0c.c
> >> +++ b/arch/s390/hypfs/hypfs_diag0c.c
> >> @@ -48,7 +48,9 @@ static void *diag0c_store(unsigned int *count)
> >>  
> >>get_online_cpus();
> >>cpu_count = num_online_cpus();
> >> -  cpu_vec = kmalloc(sizeof(*cpu_vec) * num_possible_cpus(),
> >> GFP_KERNEL);
> >> +  cpu_vec = kmalloc_array(num_possible_cpus(),
> >> +  sizeof(*cpu_vec),
> >> +  GFP_KERNEL);
> > 
> > How does this improve the situation? For any real life scenario
> > this can't overflow, but it does add an extra (pointless) runtime
> > check, since num_possible_cpus() is not a compile time constant.
> > 
> > So, why is this an "issue"?
> 
> It's not an issue but I for one still prefer consistent use of
> kmalloc_array and kcalloc.

Hello Paolo,

I will keep this in mind for future code, but would prefer not changing
this now.

Michael



Re: [PATCH] s390/hypfs: Use kmalloc_array() in diag0c_store()

2016-09-01 Thread Michael Holzheu
Am Thu, 1 Sep 2016 17:39:02 +0200
schrieb Paolo Bonzini :

> 
> 
> On 01/09/2016 12:32, Heiko Carstens wrote:
> > On Thu, Sep 01, 2016 at 11:38:15AM +0200, SF Markus Elfring wrote:
> >> From: Markus Elfring 
> >> Date: Thu, 1 Sep 2016 11:30:58 +0200
> >>
> >> A multiplication for the size determination of a memory allocation
> >> indicated that an array data structure should be processed.
> >> Thus use the corresponding function "kmalloc_array".
> >>
> >> This issue was detected by using the Coccinelle software.
> >>
> >> Signed-off-by: Markus Elfring 
> >> ---
> >>  arch/s390/hypfs/hypfs_diag0c.c | 4 +++-
> >>  1 file changed, 3 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/s390/hypfs/hypfs_diag0c.c
> >> b/arch/s390/hypfs/hypfs_diag0c.c index 0f1927c..61418a8 100644
> >> --- a/arch/s390/hypfs/hypfs_diag0c.c
> >> +++ b/arch/s390/hypfs/hypfs_diag0c.c
> >> @@ -48,7 +48,9 @@ static void *diag0c_store(unsigned int *count)
> >>  
> >>get_online_cpus();
> >>cpu_count = num_online_cpus();
> >> -  cpu_vec = kmalloc(sizeof(*cpu_vec) * num_possible_cpus(),
> >> GFP_KERNEL);
> >> +  cpu_vec = kmalloc_array(num_possible_cpus(),
> >> +  sizeof(*cpu_vec),
> >> +  GFP_KERNEL);
> > 
> > How does this improve the situation? For any real life scenario
> > this can't overflow, but it does add an extra (pointless) runtime
> > check, since num_possible_cpus() is not a compile time constant.
> > 
> > So, why is this an "issue"?
> 
> It's not an issue but I for one still prefer consistent use of
> kmalloc_array and kcalloc.

Hello Paolo,

I will keep this in mind for future code, but would prefer not changing
this now.

Michael



Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning

2016-08-19 Thread Michael Holzheu
Am Thu, 18 Aug 2016 10:42:08 -0400
schrieb Tejun Heo <t...@kernel.org>:

> Hello, Michael.
> 
> On Thu, Aug 18, 2016 at 11:30:51AM +0200, Michael Holzheu wrote:
> > Well, "no requirement" this is not 100% correct. Currently we use
> > the CPU topology information to assign newly coming CPUs to the
> > "best fitting" node.
> > 
> > Example:
> > 
> > 1) We have we two fake NUMA nodes N1 and N2 with the following CPU
> >assignment:
> > 
> >- N1: cpu 1 on chip 1
> >- N2: cpu 2 on chip 2
> > 
> > 2) A new cpu 3 is configured that lives on chip 2
> > 3) We assign cpu 3 to N2
> > 
> > We do this only if the nodes are balanced. If N2 had already one
> > more cpu than N1 we would assign the new cpu to N1.
> 
> I see.  Out of curiosity, what's the purpose of fakenuma on s390?
> There don't seem to be any actual memory locality concerns.  Is it
> just to segment memory of a machine into multiple pieces?

Correct.

> If so, why
> is that necessary, do you hit some scalability issues w/o NUMA nodes?

Yes we hit a scalability issue. Our performance team found out that for
big (> 1 TB) overcommitted (memory / swap ration > 1 : 2) systems we
see problems:

 - Zone locks are highly contended because ZONE_NORMAL is big:
   * zone->lock
   * zone->lru_lock
 - One kswapd is not enough for swapping

We hope that those problems are resolved by fake NUMA because for each
node a separate memory subsystem is created with separate zone locks
and kswapd threads.

> As for the solution, if blind RR isn't good enough, although it sounds
> like it could given that the balancing wasn't all that strong to begin
> with, would it be an option to implement an interface which just
> requests a new CPU rather than a specific one and then pick one of the
> vacant possible CPUs considering node balancing?

IMHO this is a promising idea. To say it in my words:

 - At boot time we already pin all remaining "not configured" logical
   CPUs to nodes. So all possible cpus are pinned to nodes and
   cpu_to_node() will work.

 - If a new physical cpu get's configured, we get the CPU topology
   information from the system and find the best node.

 - We get a logical cpu number from the node pool and assign the
   new physical cpu to that number.

If that works we would be as good as before. We will have a look into
the code if it is possible.

Michael



Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning

2016-08-19 Thread Michael Holzheu
Am Thu, 18 Aug 2016 10:42:08 -0400
schrieb Tejun Heo :

> Hello, Michael.
> 
> On Thu, Aug 18, 2016 at 11:30:51AM +0200, Michael Holzheu wrote:
> > Well, "no requirement" this is not 100% correct. Currently we use
> > the CPU topology information to assign newly coming CPUs to the
> > "best fitting" node.
> > 
> > Example:
> > 
> > 1) We have we two fake NUMA nodes N1 and N2 with the following CPU
> >assignment:
> > 
> >- N1: cpu 1 on chip 1
> >- N2: cpu 2 on chip 2
> > 
> > 2) A new cpu 3 is configured that lives on chip 2
> > 3) We assign cpu 3 to N2
> > 
> > We do this only if the nodes are balanced. If N2 had already one
> > more cpu than N1 we would assign the new cpu to N1.
> 
> I see.  Out of curiosity, what's the purpose of fakenuma on s390?
> There don't seem to be any actual memory locality concerns.  Is it
> just to segment memory of a machine into multiple pieces?

Correct.

> If so, why
> is that necessary, do you hit some scalability issues w/o NUMA nodes?

Yes we hit a scalability issue. Our performance team found out that for
big (> 1 TB) overcommitted (memory / swap ration > 1 : 2) systems we
see problems:

 - Zone locks are highly contended because ZONE_NORMAL is big:
   * zone->lock
   * zone->lru_lock
 - One kswapd is not enough for swapping

We hope that those problems are resolved by fake NUMA because for each
node a separate memory subsystem is created with separate zone locks
and kswapd threads.

> As for the solution, if blind RR isn't good enough, although it sounds
> like it could given that the balancing wasn't all that strong to begin
> with, would it be an option to implement an interface which just
> requests a new CPU rather than a specific one and then pick one of the
> vacant possible CPUs considering node balancing?

IMHO this is a promising idea. To say it in my words:

 - At boot time we already pin all remaining "not configured" logical
   CPUs to nodes. So all possible cpus are pinned to nodes and
   cpu_to_node() will work.

 - If a new physical cpu get's configured, we get the CPU topology
   information from the system and find the best node.

 - We get a logical cpu number from the node pool and assign the
   new physical cpu to that number.

If that works we would be as good as before. We will have a look into
the code if it is possible.

Michael



Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning

2016-08-18 Thread Michael Holzheu
Am Wed, 17 Aug 2016 09:58:55 -0400
schrieb Tejun Heo :

> Hello, Heiko.
> 
> On Wed, Aug 17, 2016 at 12:19:53AM +0200, Heiko Carstens wrote:
> > I think the easiest solution would be to simply assign all cpus,
> > for which we do not have any topology information, to an arbitrary
> > node; e.g. round robin.
> > 
> > After all the case that cpus are added later is rare and the s390
> > fake numa implementation does not know about the memory topology.
> > All it is doing is
> 
> Ah, okay, so there really is no requirement for a newly coming up cpu
> to be on a specific node.

Well, "no requirement" this is not 100% correct. Currently we use the
CPU topology information to assign newly coming CPUs to the "best
fitting" node.

Example:

1) We have we two fake NUMA nodes N1 and N2 with the following CPU
   assignment:

   - N1: cpu 1 on chip 1
   - N2: cpu 2 on chip 2

2) A new cpu 3 is configured that lives on chip 2
3) We assign cpu 3 to N2

We do this only if the nodes are balanced. If N2 had already one more
cpu than N1 we would assign the new cpu to N1.

Michael



Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning

2016-08-18 Thread Michael Holzheu
Am Wed, 17 Aug 2016 09:58:55 -0400
schrieb Tejun Heo :

> Hello, Heiko.
> 
> On Wed, Aug 17, 2016 at 12:19:53AM +0200, Heiko Carstens wrote:
> > I think the easiest solution would be to simply assign all cpus,
> > for which we do not have any topology information, to an arbitrary
> > node; e.g. round robin.
> > 
> > After all the case that cpus are added later is rare and the s390
> > fake numa implementation does not know about the memory topology.
> > All it is doing is
> 
> Ah, okay, so there really is no requirement for a newly coming up cpu
> to be on a specific node.

Well, "no requirement" this is not 100% correct. Currently we use the
CPU topology information to assign newly coming CPUs to the "best
fitting" node.

Example:

1) We have we two fake NUMA nodes N1 and N2 with the following CPU
   assignment:

   - N1: cpu 1 on chip 1
   - N2: cpu 2 on chip 2

2) A new cpu 3 is configured that lives on chip 2
3) We assign cpu 3 to N2

We do this only if the nodes are balanced. If N2 had already one more
cpu than N1 we would assign the new cpu to N1.

Michael



Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning

2016-08-17 Thread Michael Holzheu
Am Wed, 17 Aug 2016 00:19:53 +0200
schrieb Heiko Carstens :

> On Tue, Aug 16, 2016 at 11:42:05AM -0400, Tejun Heo wrote:
> > Hello, Peter.
> > 
> > On Tue, Aug 16, 2016 at 05:29:49PM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 16, 2016 at 11:20:27AM -0400, Tejun Heo wrote:
> > > > As long as the mapping doesn't change after the first onlining
> > > > of the CPU, the workqueue side shouldn't be too difficult to
> > > > fix up.  I'll look into it.  For memory allocations, as long as
> > > > the cpu <-> node mapping is established before any memory
> > > > allocation for the cpu takes place, it should be fine too, I
> > > > think.
> > > 
> > > Don't we allocate per-cpu memory for 'cpu_possible_map' on boot?
> > > There's a whole bunch of per-cpu memory users that does things
> > > like:
> > > 
> > > 
> > >   for_each_possible_cpu(cpu) {
> > >   struct foo *foo = per_cpu_ptr(_cpu_var, cpu);
> > > 
> > >   /* muck with foo */
> > >   }
> > > 
> > > 
> > > Which requires a cpu->node map for all possible cpus at boot time.
> > 
> > Ah, right.  If cpu -> node mapping is dynamic, there isn't much that
> > we can do about allocating per-cpu memory on the wrong node.  And it
> > is problematic that percpu allocations can race against an onlining
> > CPU switching its node association.
> > 
> > One way to keep the mapping stable would be reserving per-node
> > possible CPU slots so that the CPU number assigned to a new CPU is
> > on the right node.  It'd be a simple solution but would get really
> > expensive with increasing number of nodes.
> > 
> > Heiko, do you have any ideas?
> 
> I think the easiest solution would be to simply assign all cpus, for
> which we do not have any topology information, to an arbitrary node;
> e.g. round robin.
> 
> After all the case that cpus are added later is rare and the s390
> fake numa implementation does not know about the memory topology. All
> it is doing is distributing the memory to several nodes in order to
> avoid a single huge node. So that should be sort of ok.
> 
> Unless somebody has a better idea?
> 
> Michael, Martin?

If it is really required that cpu_to_node() can be called for
all possible cpus this sounds like a reasonable workaround to me.

Michael



Re: [bisected] "sched: Allow per-cpu kernel threads to run on online && !active" causes warning

2016-08-17 Thread Michael Holzheu
Am Wed, 17 Aug 2016 00:19:53 +0200
schrieb Heiko Carstens :

> On Tue, Aug 16, 2016 at 11:42:05AM -0400, Tejun Heo wrote:
> > Hello, Peter.
> > 
> > On Tue, Aug 16, 2016 at 05:29:49PM +0200, Peter Zijlstra wrote:
> > > On Tue, Aug 16, 2016 at 11:20:27AM -0400, Tejun Heo wrote:
> > > > As long as the mapping doesn't change after the first onlining
> > > > of the CPU, the workqueue side shouldn't be too difficult to
> > > > fix up.  I'll look into it.  For memory allocations, as long as
> > > > the cpu <-> node mapping is established before any memory
> > > > allocation for the cpu takes place, it should be fine too, I
> > > > think.
> > > 
> > > Don't we allocate per-cpu memory for 'cpu_possible_map' on boot?
> > > There's a whole bunch of per-cpu memory users that does things
> > > like:
> > > 
> > > 
> > >   for_each_possible_cpu(cpu) {
> > >   struct foo *foo = per_cpu_ptr(_cpu_var, cpu);
> > > 
> > >   /* muck with foo */
> > >   }
> > > 
> > > 
> > > Which requires a cpu->node map for all possible cpus at boot time.
> > 
> > Ah, right.  If cpu -> node mapping is dynamic, there isn't much that
> > we can do about allocating per-cpu memory on the wrong node.  And it
> > is problematic that percpu allocations can race against an onlining
> > CPU switching its node association.
> > 
> > One way to keep the mapping stable would be reserving per-node
> > possible CPU slots so that the CPU number assigned to a new CPU is
> > on the right node.  It'd be a simple solution but would get really
> > expensive with increasing number of nodes.
> > 
> > Heiko, do you have any ideas?
> 
> I think the easiest solution would be to simply assign all cpus, for
> which we do not have any topology information, to an arbitrary node;
> e.g. round robin.
> 
> After all the case that cpus are added later is rare and the s390
> fake numa implementation does not know about the memory topology. All
> it is doing is distributing the memory to several nodes in order to
> avoid a single huge node. So that should be sort of ok.
> 
> Unless somebody has a better idea?
> 
> Michael, Martin?

If it is really required that cpu_to_node() can be called for
all possible cpus this sounds like a reasonable workaround to me.

Michael



Re: [PATCH v2] s390/kexec: Consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()

2016-04-05 Thread Michael Holzheu
Hello Xunlei,

On Tue,  5 Apr 2016 15:09:59 +0800
Xunlei Pang <xlp...@redhat.com> wrote:
> Commit 3f625002581b ("kexec: introduce a protection mechanism
> for the crashkernel reserved memory") is a similar mechanism
> for protecting the crash kernel reserved memory to previous
> crash_map/unmap_reserved_pages() implementation, the new one
> is more generic in name and cleaner in code (besides, some
> arch may not be allowed to unmap the pgtable).
> 
> Therefore, this patch consolidates them, and uses the new
> arch_kexec_protect(unprotect)_crashkres() to replace former
> crash_map/unmap_reserved_pages() which by now has been only
> used by S390.
> 
> The consolidation work needs the crash memory to be mapped
> initially, so get rid of S390 crash kernel memblock removal
> in reserve_crashkernel().

If you fix this comment, I am fine with your patch.

Acked-by: Michael Holzheu <holz...@linux.vnet.ibm.com>



Re: [PATCH v2] s390/kexec: Consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()

2016-04-05 Thread Michael Holzheu
Hello Xunlei,

On Tue,  5 Apr 2016 15:09:59 +0800
Xunlei Pang  wrote:
> Commit 3f625002581b ("kexec: introduce a protection mechanism
> for the crashkernel reserved memory") is a similar mechanism
> for protecting the crash kernel reserved memory to previous
> crash_map/unmap_reserved_pages() implementation, the new one
> is more generic in name and cleaner in code (besides, some
> arch may not be allowed to unmap the pgtable).
> 
> Therefore, this patch consolidates them, and uses the new
> arch_kexec_protect(unprotect)_crashkres() to replace former
> crash_map/unmap_reserved_pages() which by now has been only
> used by S390.
> 
> The consolidation work needs the crash memory to be mapped
> initially, so get rid of S390 crash kernel memblock removal
> in reserve_crashkernel().

If you fix this comment, I am fine with your patch.

Acked-by: Michael Holzheu 



Re: [PATCH] s390/kexec: Consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()

2016-04-04 Thread Michael Holzheu
Hello Xunlei,

On Sat, 2 Apr 2016 09:23:50 +0800
Xunlei Pang <xp...@redhat.com> wrote:

> On 2016/04/02 at 01:41, Michael Holzheu wrote:
> > Hello Xunlei again,
> >
> > Some initial comments below...
> >
> > On Wed, 30 Mar 2016 19:47:21 +0800
> > Xunlei Pang <xlp...@redhat.com> wrote:
> >

[snip]

> >> +  os_info_crashkernel_add(0, 0);
> >> +  }
> >> +}
> > Please do not move these functions in the file. If you leave it at their
> > old location, the patch will be *much* smaller.
> 
> In fact, I did this wanting avoiding adding extra declaration.

IMHO no extra declaration is necessary (see patch below).

[snip]

> >> +/*
> >>   * PM notifier callback for kdump
> >>   */
> >>  static int machine_kdump_pm_cb(struct notifier_block *nb, unsigned long 
> >> action,
> >> @@ -43,12 +89,12 @@ static int machine_kdump_pm_cb(struct notifier_block 
> >> *nb, unsigned long action,
> >>switch (action) {
> >>case PM_SUSPEND_PREPARE:
> >>case PM_HIBERNATION_PREPARE:
> >> -  if (crashk_res.start)
> >> +  if (kexec_crash_image)
> > Why this change?
> 
> arch_kexec_protect_crashkres() will do the unmapping once kdump kernel is 
> loaded
> (i.e. kexec_crash_image is non-NULL), so we should check "kexec_crash_image" 
> here
> and do the corresponding re-mapping. 
> 
> NULL crashk_res_image means that kdump kernel is not loaded, in this case 
> mapping is
> already setup either initially in reserve_crashkernel() or by 
> arch_kexec_unprotect_crashkres().

Sorry, I missed this obvious part. So your change seems to be ok here.

There is another problem with your patch: The vmem_remove_mapping() function
can only remove mappings which have been added by vmem_add_mapping() before.
Therefore currently the vmem_remove_mapping() call is a nop and we still have
a RW mapping after the kexec system call.

If you check "/sys/kernel/debug/kernel_page_tables" you will see that the
crashkernel memory is still mapped RW after loading kdump.

To fix this we could keep the memblock_remove() and call vmem_add_mapping()
from a init function (see patch below).

[snip]

> >> --- a/arch/s390/kernel/setup.c
> >> +++ b/arch/s390/kernel/setup.c
> >> @@ -603,7 +603,7 @@ static void __init reserve_crashkernel(void)
> >>crashk_res.start = crash_base;
> >>crashk_res.end = crash_base + crash_size - 1;
> >>insert_resource(_resource, _res);
> >> -  memblock_remove(crash_base, crash_size);
> >> +  memblock_reserve(crash_base, crash_size);
> > I will discuss this next week in our team.
> 
> This can address the bad page warning when shrinking crashk_res.

Yes, shrinking crashkernel memory is currently broken on s390. Heiko Carstens
(on cc) plans to do some general rework that probably will automatically fix
this issue.

So I would suggest that you merge the following with your initial patch and
then resend the merged patch.

What do you think?

Michael
---8<
s390/kdump: Consolidate crash_map/unmap_reserved_pages() - update

 - Move functions back to keep the patch small
 - Consolidate crash_map_reserved_pages/arch_kexec_unprotect_crashkres()
 - Re-introduce memblock_remove()
 - Call arch_kexec_unprotect_crashkres() in machine_kdump_pm_init()

Signed-off-by: Michael Holzheu <holz...@linux.vnet.ibm.com>
---
 arch/s390/kernel/machine_kexec.c |   88 +--
 1 file changed, 40 insertions(+), 48 deletions(-)

--- a/arch/s390/kernel/machine_kexec.c
+++ b/arch/s390/kernel/machine_kexec.c
@@ -35,52 +35,6 @@ extern const unsigned long long relocate
 #ifdef CONFIG_CRASH_DUMP
 
 /*
- * Map or unmap crashkernel memory
- */
-static void crash_map_pages(int enable)
-{
-   unsigned long size = resource_size(_res);
-
-   BUG_ON(crashk_res.start % KEXEC_CRASH_MEM_ALIGN ||
-  size % KEXEC_CRASH_MEM_ALIGN);
-   if (enable)
-   vmem_add_mapping(crashk_res.start, size);
-   else {
-   vmem_remove_mapping(crashk_res.start, size);
-   if (size)
-   os_info_crashkernel_add(crashk_res.start, size);
-   else
-   os_info_crashkernel_add(0, 0);
-   }
-}
-
-/*
- * Map crashkernel memory
- */
-static void crash_map_reserved_pages(void)
-{
-   crash_map_pages(1);
-}
-
-/*
- * Unmap crashkernel memory
- */
-static void crash_unmap_reserved_pages(void)
-{
-   crash_map_pages(0);
-}
-
-void arch_kexec_protect_crashkres(void)
-{
-   crash_unmap_reserved_pages();
-}
-
-void arch_kexec_unprotect_crashkres(void)
-{
-   crash_map_reserved_pages();
-}
-
-/*
  * PM notifier callback for kdump

Re: [PATCH] s390/kexec: Consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()

2016-04-04 Thread Michael Holzheu
Hello Xunlei,

On Sat, 2 Apr 2016 09:23:50 +0800
Xunlei Pang  wrote:

> On 2016/04/02 at 01:41, Michael Holzheu wrote:
> > Hello Xunlei again,
> >
> > Some initial comments below...
> >
> > On Wed, 30 Mar 2016 19:47:21 +0800
> > Xunlei Pang  wrote:
> >

[snip]

> >> +  os_info_crashkernel_add(0, 0);
> >> +  }
> >> +}
> > Please do not move these functions in the file. If you leave it at their
> > old location, the patch will be *much* smaller.
> 
> In fact, I did this wanting avoiding adding extra declaration.

IMHO no extra declaration is necessary (see patch below).

[snip]

> >> +/*
> >>   * PM notifier callback for kdump
> >>   */
> >>  static int machine_kdump_pm_cb(struct notifier_block *nb, unsigned long 
> >> action,
> >> @@ -43,12 +89,12 @@ static int machine_kdump_pm_cb(struct notifier_block 
> >> *nb, unsigned long action,
> >>switch (action) {
> >>case PM_SUSPEND_PREPARE:
> >>case PM_HIBERNATION_PREPARE:
> >> -  if (crashk_res.start)
> >> +  if (kexec_crash_image)
> > Why this change?
> 
> arch_kexec_protect_crashkres() will do the unmapping once kdump kernel is 
> loaded
> (i.e. kexec_crash_image is non-NULL), so we should check "kexec_crash_image" 
> here
> and do the corresponding re-mapping. 
> 
> NULL crashk_res_image means that kdump kernel is not loaded, in this case 
> mapping is
> already setup either initially in reserve_crashkernel() or by 
> arch_kexec_unprotect_crashkres().

Sorry, I missed this obvious part. So your change seems to be ok here.

There is another problem with your patch: The vmem_remove_mapping() function
can only remove mappings which have been added by vmem_add_mapping() before.
Therefore currently the vmem_remove_mapping() call is a nop and we still have
a RW mapping after the kexec system call.

If you check "/sys/kernel/debug/kernel_page_tables" you will see that the
crashkernel memory is still mapped RW after loading kdump.

To fix this we could keep the memblock_remove() and call vmem_add_mapping()
from a init function (see patch below).

[snip]

> >> --- a/arch/s390/kernel/setup.c
> >> +++ b/arch/s390/kernel/setup.c
> >> @@ -603,7 +603,7 @@ static void __init reserve_crashkernel(void)
> >>crashk_res.start = crash_base;
> >>crashk_res.end = crash_base + crash_size - 1;
> >>insert_resource(_resource, _res);
> >> -  memblock_remove(crash_base, crash_size);
> >> +  memblock_reserve(crash_base, crash_size);
> > I will discuss this next week in our team.
> 
> This can address the bad page warning when shrinking crashk_res.

Yes, shrinking crashkernel memory is currently broken on s390. Heiko Carstens
(on cc) plans to do some general rework that probably will automatically fix
this issue.

So I would suggest that you merge the following with your initial patch and
then resend the merged patch.

What do you think?

Michael
---8<
s390/kdump: Consolidate crash_map/unmap_reserved_pages() - update

 - Move functions back to keep the patch small
 - Consolidate crash_map_reserved_pages/arch_kexec_unprotect_crashkres()
 - Re-introduce memblock_remove()
 - Call arch_kexec_unprotect_crashkres() in machine_kdump_pm_init()

Signed-off-by: Michael Holzheu 
---
 arch/s390/kernel/machine_kexec.c |   88 +--
 1 file changed, 40 insertions(+), 48 deletions(-)

--- a/arch/s390/kernel/machine_kexec.c
+++ b/arch/s390/kernel/machine_kexec.c
@@ -35,52 +35,6 @@ extern const unsigned long long relocate
 #ifdef CONFIG_CRASH_DUMP
 
 /*
- * Map or unmap crashkernel memory
- */
-static void crash_map_pages(int enable)
-{
-   unsigned long size = resource_size(_res);
-
-   BUG_ON(crashk_res.start % KEXEC_CRASH_MEM_ALIGN ||
-  size % KEXEC_CRASH_MEM_ALIGN);
-   if (enable)
-   vmem_add_mapping(crashk_res.start, size);
-   else {
-   vmem_remove_mapping(crashk_res.start, size);
-   if (size)
-   os_info_crashkernel_add(crashk_res.start, size);
-   else
-   os_info_crashkernel_add(0, 0);
-   }
-}
-
-/*
- * Map crashkernel memory
- */
-static void crash_map_reserved_pages(void)
-{
-   crash_map_pages(1);
-}
-
-/*
- * Unmap crashkernel memory
- */
-static void crash_unmap_reserved_pages(void)
-{
-   crash_map_pages(0);
-}
-
-void arch_kexec_protect_crashkres(void)
-{
-   crash_unmap_reserved_pages();
-}
-
-void arch_kexec_unprotect_crashkres(void)
-{
-   crash_map_reserved_pages();
-}
-
-/*
  * PM notifier callback for kdump
  */
 static int machine_kdump_pm_cb(struct notifier_block *nb, unsigned long action,
@

Re: [PATCH] s390/kexec: Consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()

2016-04-01 Thread Michael Holzheu
Hello Xunlei again,

Some initial comments below...

On Wed, 30 Mar 2016 19:47:21 +0800
Xunlei Pang <xlp...@redhat.com> wrote:

> Commit 3f625002581b ("kexec: introduce a protection mechanism
> for the crashkernel reserved memory") is a similar mechanism
> for protecting the crash kernel reserved memory to previous
> crash_map/unmap_reserved_pages() implementation, the new one
> is more generic in name and cleaner in code (besides, some
> arch may not be allowed to unmap the pgtable).
> 
> Therefore, this patch consolidates them, and uses the new
> arch_kexec_protect(unprotect)_crashkres() to replace former
> crash_map/unmap_reserved_pages() which by now has been only
> used by S390.
> 
> The consolidation work needs the crash memory to be mapped
> initially, so get rid of S390 crash kernel memblock removal
> in reserve_crashkernel(). Once kdump kernel is loaded, the
> new arch_kexec_protect_crashkres() implemented for S390 will
> actually unmap the pgtable like before.
> 
> The patch also fixed a S390 crash_shrink_memory() bad page warning
> in passing due to not using memblock_reserve():
>   BUG: Bad page state in process bash  pfn:7e400
>   page:03d101f9 count:0 mapcount:1 mapping: (null) index:0x0
>   flags: 0x0()
>   page dumped because: nonzero mapcount
>   Modules linked in: ghash_s390 prng aes_s390 des_s390 des_generic
>   CPU: 0 PID: 1558 Comm: bash Not tainted 4.6.0-rc1-next-20160327 #1
>73007a58 73007ae8 0002 
>73007b88 73007b00 73007b00 0022cf4e
>00a579b8 007b0dd6 00791a8c
>000b
>73007b48 73007ae8  
>070003d10001 00112f20 73007ae8 73007b48
>   Call Trace:
>   ([<00112e0c>] show_trace+0x5c/0x78)
>   ([<00112ed4>] show_stack+0x6c/0xe8)
>   ([<003f28dc>] dump_stack+0x84/0xb8)
>   ([<00235454>] bad_page+0xec/0x158)
>   ([<002357a4>] free_pages_prepare+0x2e4/0x308)
>   ([<002383a2>] free_hot_cold_page+0x42/0x198)
>   ([<001c45e0>] crash_free_reserved_phys_range+0x60/0x88)
>   ([<001c49b0>] crash_shrink_memory+0xb8/0x1a0)
>   ([<0015bcae>] kexec_crash_size_store+0x46/0x60)
>   ([<0033d326>] kernfs_fop_write+0x136/0x180)
>   ([<002b253c>] __vfs_write+0x3c/0x100)
>   ([<002b35ce>] vfs_write+0x8e/0x190)
>   ([<002b4ca0>] SyS_write+0x60/0xd0)
>   ([<0063067c>] system_call+0x244/0x264)
> 
> Cc: Michael Holzheu <holz...@linux.vnet.ibm.com>
> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
> ---
> Tested kexec/kdump on S390x
> 
>  arch/s390/kernel/machine_kexec.c | 86 
> ++--
>  arch/s390/kernel/setup.c |  7 ++--
>  include/linux/kexec.h|  2 -
>  kernel/kexec.c   | 12 --
>  kernel/kexec_core.c  | 11 +
>  5 files changed, 54 insertions(+), 64 deletions(-)
> 
> diff --git a/arch/s390/kernel/machine_kexec.c 
> b/arch/s390/kernel/machine_kexec.c
> index 2f1b721..1ec6cfc 100644
> --- a/arch/s390/kernel/machine_kexec.c
> +++ b/arch/s390/kernel/machine_kexec.c
> @@ -35,6 +35,52 @@ extern const unsigned long long relocate_kernel_len;
>  #ifdef CONFIG_CRASH_DUMP
> 
>  /*
> + * Map or unmap crashkernel memory
> + */
> +static void crash_map_pages(int enable)
> +{
> + unsigned long size = resource_size(_res);
> +
> + BUG_ON(crashk_res.start % KEXEC_CRASH_MEM_ALIGN ||
> +size % KEXEC_CRASH_MEM_ALIGN);
> + if (enable)
> + vmem_add_mapping(crashk_res.start, size);
> + else {
> + vmem_remove_mapping(crashk_res.start, size);
> + if (size)
> + os_info_crashkernel_add(crashk_res.start, size);
> + else
> + os_info_crashkernel_add(0, 0);
> + }
> +}

Please do not move these functions in the file. If you leave it at their
old location, the patch will be *much* smaller.

> +
> +/*
> + * Map crashkernel memory
> + */
> +static void crash_map_reserved_pages(void)
> +{
> + crash_map_pages(1);
> +}
> +
> +/*
> + * Unmap crashkernel memory
> + */
> +static void crash_unmap_reserved_pages(void)
> +{
> + crash_map_pages(0);
> +}
> +
> +void arch_kexec_protect_crashkres(void)
> +{
> + crash_unmap_reserved_pages();
> +}
> +
> +void arch_kexec_unprotect_crashkres(void)
> +{
> + crash_map_reserved_pages();

Re: [PATCH] s390/kexec: Consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()

2016-04-01 Thread Michael Holzheu
Hello Xunlei again,

Some initial comments below...

On Wed, 30 Mar 2016 19:47:21 +0800
Xunlei Pang  wrote:

> Commit 3f625002581b ("kexec: introduce a protection mechanism
> for the crashkernel reserved memory") is a similar mechanism
> for protecting the crash kernel reserved memory to previous
> crash_map/unmap_reserved_pages() implementation, the new one
> is more generic in name and cleaner in code (besides, some
> arch may not be allowed to unmap the pgtable).
> 
> Therefore, this patch consolidates them, and uses the new
> arch_kexec_protect(unprotect)_crashkres() to replace former
> crash_map/unmap_reserved_pages() which by now has been only
> used by S390.
> 
> The consolidation work needs the crash memory to be mapped
> initially, so get rid of S390 crash kernel memblock removal
> in reserve_crashkernel(). Once kdump kernel is loaded, the
> new arch_kexec_protect_crashkres() implemented for S390 will
> actually unmap the pgtable like before.
> 
> The patch also fixed a S390 crash_shrink_memory() bad page warning
> in passing due to not using memblock_reserve():
>   BUG: Bad page state in process bash  pfn:7e400
>   page:03d101f9 count:0 mapcount:1 mapping: (null) index:0x0
>   flags: 0x0()
>   page dumped because: nonzero mapcount
>   Modules linked in: ghash_s390 prng aes_s390 des_s390 des_generic
>   CPU: 0 PID: 1558 Comm: bash Not tainted 4.6.0-rc1-next-20160327 #1
>73007a58 73007ae8 0002 
>73007b88 73007b00 73007b00 0022cf4e
>00a579b8 007b0dd6 00791a8c
>000b
>73007b48 73007ae8  
>070003d10001 00112f20 73007ae8 73007b48
>   Call Trace:
>   ([<00112e0c>] show_trace+0x5c/0x78)
>   ([<00112ed4>] show_stack+0x6c/0xe8)
>   ([<003f28dc>] dump_stack+0x84/0xb8)
>   ([<00235454>] bad_page+0xec/0x158)
>   ([<002357a4>] free_pages_prepare+0x2e4/0x308)
>   ([<002383a2>] free_hot_cold_page+0x42/0x198)
>   ([<001c45e0>] crash_free_reserved_phys_range+0x60/0x88)
>   ([<001c49b0>] crash_shrink_memory+0xb8/0x1a0)
>   ([<0015bcae>] kexec_crash_size_store+0x46/0x60)
>   ([<0033d326>] kernfs_fop_write+0x136/0x180)
>   ([<0000002b253c>] __vfs_write+0x3c/0x100)
>   ([<002b35ce>] vfs_write+0x8e/0x190)
>   ([<002b4ca0>] SyS_write+0x60/0xd0)
>   ([<0063067c>] system_call+0x244/0x264)
> 
> Cc: Michael Holzheu 
> Signed-off-by: Xunlei Pang 
> ---
> Tested kexec/kdump on S390x
> 
>  arch/s390/kernel/machine_kexec.c | 86 
> ++--
>  arch/s390/kernel/setup.c |  7 ++--
>  include/linux/kexec.h|  2 -
>  kernel/kexec.c   | 12 --
>  kernel/kexec_core.c  | 11 +
>  5 files changed, 54 insertions(+), 64 deletions(-)
> 
> diff --git a/arch/s390/kernel/machine_kexec.c 
> b/arch/s390/kernel/machine_kexec.c
> index 2f1b721..1ec6cfc 100644
> --- a/arch/s390/kernel/machine_kexec.c
> +++ b/arch/s390/kernel/machine_kexec.c
> @@ -35,6 +35,52 @@ extern const unsigned long long relocate_kernel_len;
>  #ifdef CONFIG_CRASH_DUMP
> 
>  /*
> + * Map or unmap crashkernel memory
> + */
> +static void crash_map_pages(int enable)
> +{
> + unsigned long size = resource_size(_res);
> +
> + BUG_ON(crashk_res.start % KEXEC_CRASH_MEM_ALIGN ||
> +size % KEXEC_CRASH_MEM_ALIGN);
> + if (enable)
> + vmem_add_mapping(crashk_res.start, size);
> + else {
> + vmem_remove_mapping(crashk_res.start, size);
> + if (size)
> + os_info_crashkernel_add(crashk_res.start, size);
> + else
> + os_info_crashkernel_add(0, 0);
> + }
> +}

Please do not move these functions in the file. If you leave it at their
old location, the patch will be *much* smaller.

> +
> +/*
> + * Map crashkernel memory
> + */
> +static void crash_map_reserved_pages(void)
> +{
> + crash_map_pages(1);
> +}
> +
> +/*
> + * Unmap crashkernel memory
> + */
> +static void crash_unmap_reserved_pages(void)
> +{
> + crash_map_pages(0);
> +}
> +
> +void arch_kexec_protect_crashkres(void)
> +{
> + crash_unmap_reserved_pages();
> +}
> +
> +void arch_kexec_unprotect_crashkres(void)
> +{
> + crash_map_reserved_pages();
> +}

Please replace the crash_(un)map_reserved_pages functions
with the new arch_kexec_(un)protect

Re: [PATCH] s390/kexec: Consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()

2016-04-01 Thread Michael Holzheu
Hello Xunlei,

This patch can has potential to create some funny side effects.
Especially the change from memblock_remove() to memblock_reserve()
and the later call of reserve_crashkernel().

Give me some time. I will look into this next week.

Michael

On Wed, 30 Mar 2016 19:47:21 +0800
Xunlei Pang <xlp...@redhat.com> wrote:

> Commit 3f625002581b ("kexec: introduce a protection mechanism
> for the crashkernel reserved memory") is a similar mechanism
> for protecting the crash kernel reserved memory to previous
> crash_map/unmap_reserved_pages() implementation, the new one
> is more generic in name and cleaner in code (besides, some
> arch may not be allowed to unmap the pgtable).
> 
> Therefore, this patch consolidates them, and uses the new
> arch_kexec_protect(unprotect)_crashkres() to replace former
> crash_map/unmap_reserved_pages() which by now has been only
> used by S390.
> 
> The consolidation work needs the crash memory to be mapped
> initially, so get rid of S390 crash kernel memblock removal
> in reserve_crashkernel(). Once kdump kernel is loaded, the
> new arch_kexec_protect_crashkres() implemented for S390 will
> actually unmap the pgtable like before.
> 
> The patch also fixed a S390 crash_shrink_memory() bad page warning
> in passing due to not using memblock_reserve():
>   BUG: Bad page state in process bash  pfn:7e400
>   page:03d101f9 count:0 mapcount:1 mapping: (null) index:0x0
>   flags: 0x0()
>   page dumped because: nonzero mapcount
>   Modules linked in: ghash_s390 prng aes_s390 des_s390 des_generic
>   CPU: 0 PID: 1558 Comm: bash Not tainted 4.6.0-rc1-next-20160327 #1
>73007a58 73007ae8 0002 
>73007b88 73007b00 73007b00 0022cf4e
>00a579b8 007b0dd6 00791a8c
>000b
>73007b48 73007ae8  
>070003d10001 00112f20 73007ae8 73007b48
>   Call Trace:
>   ([<00112e0c>] show_trace+0x5c/0x78)
>   ([<00112ed4>] show_stack+0x6c/0xe8)
>   ([<003f28dc>] dump_stack+0x84/0xb8)
>   ([<00235454>] bad_page+0xec/0x158)
>   ([<002357a4>] free_pages_prepare+0x2e4/0x308)
>   ([<002383a2>] free_hot_cold_page+0x42/0x198)
>   ([<001c45e0>] crash_free_reserved_phys_range+0x60/0x88)
>   ([<001c49b0>] crash_shrink_memory+0xb8/0x1a0)
>   ([<0015bcae>] kexec_crash_size_store+0x46/0x60)
>   ([<0033d326>] kernfs_fop_write+0x136/0x180)
>   ([<002b253c>] __vfs_write+0x3c/0x100)
>   ([<002b35ce>] vfs_write+0x8e/0x190)
>   ([<002b4ca0>] SyS_write+0x60/0xd0)
>   ([<0063067c>] system_call+0x244/0x264)
> 
> Cc: Michael Holzheu <holz...@linux.vnet.ibm.com>
> Signed-off-by: Xunlei Pang <xlp...@redhat.com>
> ---
> Tested kexec/kdump on S390x
> 
>  arch/s390/kernel/machine_kexec.c | 86 
> ++--
>  arch/s390/kernel/setup.c |  7 ++--
>  include/linux/kexec.h|  2 -
>  kernel/kexec.c   | 12 --
>  kernel/kexec_core.c  | 11 +
>  5 files changed, 54 insertions(+), 64 deletions(-)
> 
> diff --git a/arch/s390/kernel/machine_kexec.c 
> b/arch/s390/kernel/machine_kexec.c
> index 2f1b721..1ec6cfc 100644
> --- a/arch/s390/kernel/machine_kexec.c
> +++ b/arch/s390/kernel/machine_kexec.c
> @@ -35,6 +35,52 @@ extern const unsigned long long relocate_kernel_len;
>  #ifdef CONFIG_CRASH_DUMP
> 
>  /*
> + * Map or unmap crashkernel memory
> + */
> +static void crash_map_pages(int enable)
> +{
> + unsigned long size = resource_size(_res);
> +
> + BUG_ON(crashk_res.start % KEXEC_CRASH_MEM_ALIGN ||
> +size % KEXEC_CRASH_MEM_ALIGN);
> + if (enable)
> + vmem_add_mapping(crashk_res.start, size);
> + else {
> + vmem_remove_mapping(crashk_res.start, size);
> + if (size)
> + os_info_crashkernel_add(crashk_res.start, size);
> + else
> + os_info_crashkernel_add(0, 0);
> + }
> +}
> +
> +/*
> + * Map crashkernel memory
> + */
> +static void crash_map_reserved_pages(void)
> +{
> + crash_map_pages(1);
> +}
> +
> +/*
> + * Unmap crashkernel memory
> + */
> +static void crash_unmap_reserved_pages(void)
> +{
> + crash_map_pages(0);
> +}
> +
> +void arch_kexec_protect_crashkres(void)
> +{
> + crash_unmap_reserved_pages();
> +}
> +
> +void arch_kexec_unpr

Re: [PATCH] s390/kexec: Consolidate crash_map/unmap_reserved_pages() and arch_kexec_protect(unprotect)_crashkres()

2016-04-01 Thread Michael Holzheu
Hello Xunlei,

This patch can has potential to create some funny side effects.
Especially the change from memblock_remove() to memblock_reserve()
and the later call of reserve_crashkernel().

Give me some time. I will look into this next week.

Michael

On Wed, 30 Mar 2016 19:47:21 +0800
Xunlei Pang  wrote:

> Commit 3f625002581b ("kexec: introduce a protection mechanism
> for the crashkernel reserved memory") is a similar mechanism
> for protecting the crash kernel reserved memory to previous
> crash_map/unmap_reserved_pages() implementation, the new one
> is more generic in name and cleaner in code (besides, some
> arch may not be allowed to unmap the pgtable).
> 
> Therefore, this patch consolidates them, and uses the new
> arch_kexec_protect(unprotect)_crashkres() to replace former
> crash_map/unmap_reserved_pages() which by now has been only
> used by S390.
> 
> The consolidation work needs the crash memory to be mapped
> initially, so get rid of S390 crash kernel memblock removal
> in reserve_crashkernel(). Once kdump kernel is loaded, the
> new arch_kexec_protect_crashkres() implemented for S390 will
> actually unmap the pgtable like before.
> 
> The patch also fixed a S390 crash_shrink_memory() bad page warning
> in passing due to not using memblock_reserve():
>   BUG: Bad page state in process bash  pfn:7e400
>   page:03d101f9 count:0 mapcount:1 mapping: (null) index:0x0
>   flags: 0x0()
>   page dumped because: nonzero mapcount
>   Modules linked in: ghash_s390 prng aes_s390 des_s390 des_generic
>   CPU: 0 PID: 1558 Comm: bash Not tainted 4.6.0-rc1-next-20160327 #1
>73007a58 73007ae8 0002 
>73007b88 73007b00 73007b00 0022cf4e
>00a579b8 007b0dd6 00791a8c
>000b
>73007b48 73007ae8  
>070003d10001 00112f20 73007ae8 73007b48
>   Call Trace:
>   ([<00112e0c>] show_trace+0x5c/0x78)
>   ([<00112ed4>] show_stack+0x6c/0xe8)
>   ([<003f28dc>] dump_stack+0x84/0xb8)
>   ([<00235454>] bad_page+0xec/0x158)
>   ([<002357a4>] free_pages_prepare+0x2e4/0x308)
>   ([<002383a2>] free_hot_cold_page+0x42/0x198)
>   ([<001c45e0>] crash_free_reserved_phys_range+0x60/0x88)
>   ([<001c49b0>] crash_shrink_memory+0xb8/0x1a0)
>   ([<0015bcae>] kexec_crash_size_store+0x46/0x60)
>   ([<0033d326>] kernfs_fop_write+0x136/0x180)
>   ([<0000002b253c>] __vfs_write+0x3c/0x100)
>   ([<002b35ce>] vfs_write+0x8e/0x190)
>   ([<002b4ca0>] SyS_write+0x60/0xd0)
>   ([<0063067c>] system_call+0x244/0x264)
> 
> Cc: Michael Holzheu 
> Signed-off-by: Xunlei Pang 
> ---
> Tested kexec/kdump on S390x
> 
>  arch/s390/kernel/machine_kexec.c | 86 
> ++--
>  arch/s390/kernel/setup.c |  7 ++--
>  include/linux/kexec.h|  2 -
>  kernel/kexec.c   | 12 --
>  kernel/kexec_core.c  | 11 +
>  5 files changed, 54 insertions(+), 64 deletions(-)
> 
> diff --git a/arch/s390/kernel/machine_kexec.c 
> b/arch/s390/kernel/machine_kexec.c
> index 2f1b721..1ec6cfc 100644
> --- a/arch/s390/kernel/machine_kexec.c
> +++ b/arch/s390/kernel/machine_kexec.c
> @@ -35,6 +35,52 @@ extern const unsigned long long relocate_kernel_len;
>  #ifdef CONFIG_CRASH_DUMP
> 
>  /*
> + * Map or unmap crashkernel memory
> + */
> +static void crash_map_pages(int enable)
> +{
> + unsigned long size = resource_size(_res);
> +
> + BUG_ON(crashk_res.start % KEXEC_CRASH_MEM_ALIGN ||
> +size % KEXEC_CRASH_MEM_ALIGN);
> + if (enable)
> + vmem_add_mapping(crashk_res.start, size);
> + else {
> + vmem_remove_mapping(crashk_res.start, size);
> + if (size)
> + os_info_crashkernel_add(crashk_res.start, size);
> + else
> + os_info_crashkernel_add(0, 0);
> + }
> +}
> +
> +/*
> + * Map crashkernel memory
> + */
> +static void crash_map_reserved_pages(void)
> +{
> + crash_map_pages(1);
> +}
> +
> +/*
> + * Unmap crashkernel memory
> + */
> +static void crash_unmap_reserved_pages(void)
> +{
> + crash_map_pages(0);
> +}
> +
> +void arch_kexec_protect_crashkres(void)
> +{
> + crash_unmap_reserved_pages();
> +}
> +
> +void arch_kexec_unprotect_crashkres(void)
> +{
> + crash_map_reserv

Re: [PATCH] kexec: unmap reserved pages for each error-return way

2016-01-28 Thread Michael Holzheu
On Thu, 28 Jan 2016 21:12:54 +0800
Xunlei Pang  wrote:

> On 2016/01/28 at 20:44, Michael Holzheu wrote:
> > On Thu, 28 Jan 2016 19:56:56 +0800
> > Xunlei Pang  wrote:
> >
> >> On 2016/01/28 at 18:32, Michael Holzheu wrote:
> >>> On Wed, 27 Jan 2016 11:15:46 -0800
> >>> Andrew Morton  wrote:
> >>>
> >>>> On Wed, 27 Jan 2016 14:48:31 +0300 Dmitry Safonov 
> >>>>  wrote:
> >>>>
> >>>>> For allocation of kimage failure or kexec_prepare or load segments
> >>>>> errors there is no need to keep crashkernel memory mapped.
> >>>>> It will affect only s390 as map/unmap hook defined only for it.
> >>>>> As on unmap s390 also changes os_info structure let's check return code
> >>>>> and add info only on success.
> >>>>>
> >>>> This conflicts (both mechanically and somewhat conceptually) with
> >>>> Xunlei Pang's "kexec: Introduce a protection mechanism for the
> >>>> crashkernel reserved memory" and "kexec: provide
> >>>> arch_kexec_protect(unprotect)_crashkres()".
> >>>>
> >>>> http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory.patch
> >>>> http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory-v4.patch
> >>>>
> >>>> and
> >>>>
> >>>> http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres.patch
> >>>> http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres-v4.patch
> >>> Hmm, It looks to me that arch_kexec_(un)protect_crashkres() has exactly
> >>> the same semantics as crash_(un)map_reserved_pages().
> >>>
> >>> On s390 we don't have the crashkernel memory mapped and therefore need
> >>> crash_map_reserved_pages() before loading something into crashkernel
> >>> memory.
> >> I don't know s390, just curious, if s390 doesn't have crash kernel memory 
> >> mapped,
> >> what's the purpose of the commit(558df7209e)  for s390 as the reserved 
> >> crash memory
> >> with no kernel mapping already means the protection is on?
> > When we reserve crashkernel memory on s390 ("crashkernel=" kernel 
> > parameter),
> > we create a memory hole without page tables.
> >
> > Commit (558df7209e) was necessary to load a kernel/ramdisk into
> > the memory hole with the kexec() system call.
> >
> > We create a temporary mapping with crash_map_reserved_pages(), then
> > copy the kernel/ramdisk and finally remove the mapping again
> > via crash_unmap_reserved_pages().
> 
> Thanks for the explanation.
> So, on s390 the physical memory address has the same value as its kernel 
> virtual address,
> and kmap() actually returns the value of the physical address of the page, 
> right?

Correct. On s390 kmap() always return the physical address of the page.

We have an 1:1 mapping for all the physical memory. For this area
virtual=real. In addition to that we have the vmalloc area above
the 1:1 mapping where some of the memory is mapped a second time.

Michael



Re: [PATCH] kexec: unmap reserved pages for each error-return way

2016-01-28 Thread Michael Holzheu
On Thu, 28 Jan 2016 19:56:56 +0800
Xunlei Pang  wrote:

> On 2016/01/28 at 18:32, Michael Holzheu wrote:
> > On Wed, 27 Jan 2016 11:15:46 -0800
> > Andrew Morton  wrote:
> >
> >> On Wed, 27 Jan 2016 14:48:31 +0300 Dmitry Safonov  
> >> wrote:
> >>
> >>> For allocation of kimage failure or kexec_prepare or load segments
> >>> errors there is no need to keep crashkernel memory mapped.
> >>> It will affect only s390 as map/unmap hook defined only for it.
> >>> As on unmap s390 also changes os_info structure let's check return code
> >>> and add info only on success.
> >>>
> >> This conflicts (both mechanically and somewhat conceptually) with
> >> Xunlei Pang's "kexec: Introduce a protection mechanism for the
> >> crashkernel reserved memory" and "kexec: provide
> >> arch_kexec_protect(unprotect)_crashkres()".
> >>
> >> http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory.patch
> >> http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory-v4.patch
> >>
> >> and
> >>
> >> http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres.patch
> >> http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres-v4.patch
> > Hmm, It looks to me that arch_kexec_(un)protect_crashkres() has exactly
> > the same semantics as crash_(un)map_reserved_pages().
> >
> > On s390 we don't have the crashkernel memory mapped and therefore need
> > crash_map_reserved_pages() before loading something into crashkernel
> > memory.
> 
> I don't know s390, just curious, if s390 doesn't have crash kernel memory 
> mapped,
> what's the purpose of the commit(558df7209e)  for s390 as the reserved crash 
> memory
> with no kernel mapping already means the protection is on?

When we reserve crashkernel memory on s390 ("crashkernel=" kernel parameter),
we create a memory hole without page tables.

Commit (558df7209e) was necessary to load a kernel/ramdisk into
the memory hole with the kexec() system call.

We create a temporary mapping with crash_map_reserved_pages(), then
copy the kernel/ramdisk and finally remove the mapping again
via crash_unmap_reserved_pages().

We did that all in order to protect the preloaded kernel and ramdisk.

I forgot the details why commit(558df7209e) wasn't necessary before.
AFAIK it became necessary because of some kdump (mmap?) rework.

Michael



Re: [PATCH] kexec: unmap reserved pages for each error-return way

2016-01-28 Thread Michael Holzheu
On Wed, 27 Jan 2016 11:15:46 -0800
Andrew Morton  wrote:

> On Wed, 27 Jan 2016 14:48:31 +0300 Dmitry Safonov  
> wrote:
> 
> > For allocation of kimage failure or kexec_prepare or load segments
> > errors there is no need to keep crashkernel memory mapped.
> > It will affect only s390 as map/unmap hook defined only for it.
> > As on unmap s390 also changes os_info structure let's check return code
> > and add info only on success.
> > 
> 
> This conflicts (both mechanically and somewhat conceptually) with
> Xunlei Pang's "kexec: Introduce a protection mechanism for the
> crashkernel reserved memory" and "kexec: provide
> arch_kexec_protect(unprotect)_crashkres()".
> 
> http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory.patch
> http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory-v4.patch
> 
> and
> 
> http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres.patch
> http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres-v4.patch

Hmm, It looks to me that arch_kexec_(un)protect_crashkres() has exactly
the same semantics as crash_(un)map_reserved_pages().

On s390 we don't have the crashkernel memory mapped and therefore need
crash_map_reserved_pages() before loading something into crashkernel
memory.

Perhaps I missed something?
Michael



Re: [PATCH] kexec: unmap reserved pages for each error-return way

2016-01-28 Thread Michael Holzheu
On Thu, 28 Jan 2016 19:56:56 +0800
Xunlei Pang <xp...@redhat.com> wrote:

> On 2016/01/28 at 18:32, Michael Holzheu wrote:
> > On Wed, 27 Jan 2016 11:15:46 -0800
> > Andrew Morton <a...@linux-foundation.org> wrote:
> >
> >> On Wed, 27 Jan 2016 14:48:31 +0300 Dmitry Safonov <dsafo...@virtuozzo.com> 
> >> wrote:
> >>
> >>> For allocation of kimage failure or kexec_prepare or load segments
> >>> errors there is no need to keep crashkernel memory mapped.
> >>> It will affect only s390 as map/unmap hook defined only for it.
> >>> As on unmap s390 also changes os_info structure let's check return code
> >>> and add info only on success.
> >>>
> >> This conflicts (both mechanically and somewhat conceptually) with
> >> Xunlei Pang's "kexec: Introduce a protection mechanism for the
> >> crashkernel reserved memory" and "kexec: provide
> >> arch_kexec_protect(unprotect)_crashkres()".
> >>
> >> http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory.patch
> >> http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory-v4.patch
> >>
> >> and
> >>
> >> http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres.patch
> >> http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres-v4.patch
> > Hmm, It looks to me that arch_kexec_(un)protect_crashkres() has exactly
> > the same semantics as crash_(un)map_reserved_pages().
> >
> > On s390 we don't have the crashkernel memory mapped and therefore need
> > crash_map_reserved_pages() before loading something into crashkernel
> > memory.
> 
> I don't know s390, just curious, if s390 doesn't have crash kernel memory 
> mapped,
> what's the purpose of the commit(558df7209e)  for s390 as the reserved crash 
> memory
> with no kernel mapping already means the protection is on?

When we reserve crashkernel memory on s390 ("crashkernel=" kernel parameter),
we create a memory hole without page tables.

Commit (558df7209e) was necessary to load a kernel/ramdisk into
the memory hole with the kexec() system call.

We create a temporary mapping with crash_map_reserved_pages(), then
copy the kernel/ramdisk and finally remove the mapping again
via crash_unmap_reserved_pages().

We did that all in order to protect the preloaded kernel and ramdisk.

I forgot the details why commit(558df7209e) wasn't necessary before.
AFAIK it became necessary because of some kdump (mmap?) rework.

Michael



Re: [PATCH] kexec: unmap reserved pages for each error-return way

2016-01-28 Thread Michael Holzheu
On Wed, 27 Jan 2016 11:15:46 -0800
Andrew Morton  wrote:

> On Wed, 27 Jan 2016 14:48:31 +0300 Dmitry Safonov  
> wrote:
> 
> > For allocation of kimage failure or kexec_prepare or load segments
> > errors there is no need to keep crashkernel memory mapped.
> > It will affect only s390 as map/unmap hook defined only for it.
> > As on unmap s390 also changes os_info structure let's check return code
> > and add info only on success.
> > 
> 
> This conflicts (both mechanically and somewhat conceptually) with
> Xunlei Pang's "kexec: Introduce a protection mechanism for the
> crashkernel reserved memory" and "kexec: provide
> arch_kexec_protect(unprotect)_crashkres()".
> 
> http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory.patch
> http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory-v4.patch
> 
> and
> 
> http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres.patch
> http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres-v4.patch

Hmm, It looks to me that arch_kexec_(un)protect_crashkres() has exactly
the same semantics as crash_(un)map_reserved_pages().

On s390 we don't have the crashkernel memory mapped and therefore need
crash_map_reserved_pages() before loading something into crashkernel
memory.

Perhaps I missed something?
Michael



Re: [PATCH] kexec: unmap reserved pages for each error-return way

2016-01-28 Thread Michael Holzheu
On Thu, 28 Jan 2016 21:12:54 +0800
Xunlei Pang <xp...@redhat.com> wrote:

> On 2016/01/28 at 20:44, Michael Holzheu wrote:
> > On Thu, 28 Jan 2016 19:56:56 +0800
> > Xunlei Pang <xp...@redhat.com> wrote:
> >
> >> On 2016/01/28 at 18:32, Michael Holzheu wrote:
> >>> On Wed, 27 Jan 2016 11:15:46 -0800
> >>> Andrew Morton <a...@linux-foundation.org> wrote:
> >>>
> >>>> On Wed, 27 Jan 2016 14:48:31 +0300 Dmitry Safonov 
> >>>> <dsafo...@virtuozzo.com> wrote:
> >>>>
> >>>>> For allocation of kimage failure or kexec_prepare or load segments
> >>>>> errors there is no need to keep crashkernel memory mapped.
> >>>>> It will affect only s390 as map/unmap hook defined only for it.
> >>>>> As on unmap s390 also changes os_info structure let's check return code
> >>>>> and add info only on success.
> >>>>>
> >>>> This conflicts (both mechanically and somewhat conceptually) with
> >>>> Xunlei Pang's "kexec: Introduce a protection mechanism for the
> >>>> crashkernel reserved memory" and "kexec: provide
> >>>> arch_kexec_protect(unprotect)_crashkres()".
> >>>>
> >>>> http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory.patch
> >>>> http://ozlabs.org/~akpm/mmots/broken-out/kexec-introduce-a-protection-mechanism-for-the-crashkernel-reserved-memory-v4.patch
> >>>>
> >>>> and
> >>>>
> >>>> http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres.patch
> >>>> http://ozlabs.org/~akpm/mmots/broken-out/kexec-provide-arch_kexec_protectunprotect_crashkres-v4.patch
> >>> Hmm, It looks to me that arch_kexec_(un)protect_crashkres() has exactly
> >>> the same semantics as crash_(un)map_reserved_pages().
> >>>
> >>> On s390 we don't have the crashkernel memory mapped and therefore need
> >>> crash_map_reserved_pages() before loading something into crashkernel
> >>> memory.
> >> I don't know s390, just curious, if s390 doesn't have crash kernel memory 
> >> mapped,
> >> what's the purpose of the commit(558df7209e)  for s390 as the reserved 
> >> crash memory
> >> with no kernel mapping already means the protection is on?
> > When we reserve crashkernel memory on s390 ("crashkernel=" kernel 
> > parameter),
> > we create a memory hole without page tables.
> >
> > Commit (558df7209e) was necessary to load a kernel/ramdisk into
> > the memory hole with the kexec() system call.
> >
> > We create a temporary mapping with crash_map_reserved_pages(), then
> > copy the kernel/ramdisk and finally remove the mapping again
> > via crash_unmap_reserved_pages().
> 
> Thanks for the explanation.
> So, on s390 the physical memory address has the same value as its kernel 
> virtual address,
> and kmap() actually returns the value of the physical address of the page, 
> right?

Correct. On s390 kmap() always return the physical address of the page.

We have an 1:1 mapping for all the physical memory. For this area
virtual=real. In addition to that we have the vmalloc area above
the 1:1 mapping where some of the memory is mapped a second time.

Michael



Re: [PATCH] numa: fix /proc//numa_maps for hugetlbfs on s390

2016-01-26 Thread Michael Holzheu
On Mon, 25 Jan 2016 14:51:16 -0800
Andrew Morton  wrote:

> On Mon, 25 Jan 2016 17:30:42 +0100 Michael Holzheu 
>  wrote:
> 
> > When working with hugetlbfs ptes (which are actually pmds) is not
> > valid to directly use pte functions like pte_present() because the
> > hardware bit layout of pmds and ptes can be different. This is the
> > case on s390. Therefore we have to convert the hugetlbfs ptes first
> > into a valid pte encoding with huge_ptep_get().
> > 
> > Currently the /proc//numa_maps code uses hugetlbfs ptes without
> > huge_ptep_get(). On s390 this leads to the following two problems:
> > 
> > 1) The pte_present() function returns false (instead of true) for
> >PROT_NONE hugetlb ptes. Therefore PROT_NONE vmas are missing
> >completely in the "numa_maps" output.
> > 
> > 2) The pte_dirty() function always returns false for all hugetlb ptes.
> >Therefore these pages are reported as "mapped=xxx" instead of
> >"dirty=xxx".
> > 
> > Therefore use huge_ptep_get() to correctly convert the hugetlb ptes.
> 
> I'm aiming this at 4.5 only.  Please let me know if you think that a
> -stable backport is warranted.

S390 has NUMA support since kernel 4.3, therefore:

Cc: sta...@vger.kernel.org # v4.3+ 

Michael



Re: [PATCH] numa: fix /proc//numa_maps for hugetlbfs on s390

2016-01-26 Thread Michael Holzheu
On Mon, 25 Jan 2016 14:51:16 -0800
Andrew Morton <a...@linux-foundation.org> wrote:

> On Mon, 25 Jan 2016 17:30:42 +0100 Michael Holzheu 
> <holz...@linux.vnet.ibm.com> wrote:
> 
> > When working with hugetlbfs ptes (which are actually pmds) is not
> > valid to directly use pte functions like pte_present() because the
> > hardware bit layout of pmds and ptes can be different. This is the
> > case on s390. Therefore we have to convert the hugetlbfs ptes first
> > into a valid pte encoding with huge_ptep_get().
> > 
> > Currently the /proc//numa_maps code uses hugetlbfs ptes without
> > huge_ptep_get(). On s390 this leads to the following two problems:
> > 
> > 1) The pte_present() function returns false (instead of true) for
> >PROT_NONE hugetlb ptes. Therefore PROT_NONE vmas are missing
> >completely in the "numa_maps" output.
> > 
> > 2) The pte_dirty() function always returns false for all hugetlb ptes.
> >Therefore these pages are reported as "mapped=xxx" instead of
> >"dirty=xxx".
> > 
> > Therefore use huge_ptep_get() to correctly convert the hugetlb ptes.
> 
> I'm aiming this at 4.5 only.  Please let me know if you think that a
> -stable backport is warranted.

S390 has NUMA support since kernel 4.3, therefore:

Cc: sta...@vger.kernel.org # v4.3+ 

Michael



[PATCH] numa: fix /proc//numa_maps for hugetlbfs on s390

2016-01-25 Thread Michael Holzheu
When working with hugetlbfs ptes (which are actually pmds) is not
valid to directly use pte functions like pte_present() because the
hardware bit layout of pmds and ptes can be different. This is the
case on s390. Therefore we have to convert the hugetlbfs ptes first
into a valid pte encoding with huge_ptep_get().

Currently the /proc//numa_maps code uses hugetlbfs ptes without
huge_ptep_get(). On s390 this leads to the following two problems:

1) The pte_present() function returns false (instead of true) for
   PROT_NONE hugetlb ptes. Therefore PROT_NONE vmas are missing
   completely in the "numa_maps" output.

2) The pte_dirty() function always returns false for all hugetlb ptes.
   Therefore these pages are reported as "mapped=xxx" instead of
   "dirty=xxx".

Therefore use huge_ptep_get() to correctly convert the hugetlb ptes.

Reviewed-by: Gerald Schaefer 
Signed-off-by: Michael Holzheu 
---
 fs/proc/task_mmu.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 85d16c6..4a0c31f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1552,18 +1552,19 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
 static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end, struct mm_walk *walk)
 {
+   pte_t huge_pte = huge_ptep_get(pte);
struct numa_maps *md;
struct page *page;
 
-   if (!pte_present(*pte))
+   if (!pte_present(huge_pte))
return 0;
 
-   page = pte_page(*pte);
+   page = pte_page(huge_pte);
if (!page)
return 0;
 
md = walk->private;
-   gather_stats(page, md, pte_dirty(*pte), 1);
+   gather_stats(page, md, pte_dirty(huge_pte), 1);
return 0;
 }
 
-- 
2.3.9



[PATCH] numa: fix /proc//numa_maps for hugetlbfs on s390

2016-01-25 Thread Michael Holzheu
When working with hugetlbfs ptes (which are actually pmds) is not
valid to directly use pte functions like pte_present() because the
hardware bit layout of pmds and ptes can be different. This is the
case on s390. Therefore we have to convert the hugetlbfs ptes first
into a valid pte encoding with huge_ptep_get().

Currently the /proc//numa_maps code uses hugetlbfs ptes without
huge_ptep_get(). On s390 this leads to the following two problems:

1) The pte_present() function returns false (instead of true) for
   PROT_NONE hugetlb ptes. Therefore PROT_NONE vmas are missing
   completely in the "numa_maps" output.

2) The pte_dirty() function always returns false for all hugetlb ptes.
   Therefore these pages are reported as "mapped=xxx" instead of
   "dirty=xxx".

Therefore use huge_ptep_get() to correctly convert the hugetlb ptes.

Reviewed-by: Gerald Schaefer <gerald.schae...@de.ibm.com>
Signed-off-by: Michael Holzheu <holz...@linux.vnet.ibm.com>
---
 fs/proc/task_mmu.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 85d16c6..4a0c31f 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1552,18 +1552,19 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
 static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end, struct mm_walk *walk)
 {
+   pte_t huge_pte = huge_ptep_get(pte);
struct numa_maps *md;
struct page *page;
 
-   if (!pte_present(*pte))
+   if (!pte_present(huge_pte))
return 0;
 
-   page = pte_page(*pte);
+   page = pte_page(huge_pte);
if (!page)
return 0;
 
md = walk->private;
-   gather_stats(page, md, pte_dirty(*pte), 1);
+   gather_stats(page, md, pte_dirty(huge_pte), 1);
return 0;
 }
 
-- 
2.3.9



[PATCH] numa: fix /proc//numa_maps on s390

2016-01-20 Thread Michael Holzheu
When working with huge page pmds in general is not valid to directly
use pte functions like pte_present() because the hardware bit layout
of pmds and ptes can be different. This is the case on s390. Therefore
we have to convert the pmds first into a valid pte encoding with
huge_ptep_get(). So add the two missing functions calls to do this.

Reviewed-by: Gerald Schaefer 
Signed-off-by: Michael Holzheu 
---
 fs/proc/task_mmu.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 65a1b6c..e287e32 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1520,7 +1520,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
pte_t *pte;
 
if (pmd_trans_huge_lock(pmd, vma, )) {
-   pte_t huge_pte = *(pte_t *)pmd;
+   pte_t huge_pte = huge_ptep_get((pte_t *)pmd);
struct page *page;
 
page = can_gather_numa_stats(huge_pte, vma, addr);
@@ -1548,18 +1548,19 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
 static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end, struct mm_walk *walk)
 {
+   pte_t huge_pte = huge_ptep_get(pte);
struct numa_maps *md;
struct page *page;
 
-   if (!pte_present(*pte))
+   if (!pte_present(huge_pte))
return 0;
 
-   page = pte_page(*pte);
+   page = pte_page(huge_pte);
if (!page)
return 0;
 
md = walk->private;
-   gather_stats(page, md, pte_dirty(*pte), 1);
+   gather_stats(page, md, pte_dirty(huge_pte), 1);
return 0;
 }
 
-- 
2.3.9



[PATCH] numa: fix /proc//numa_maps on s390

2016-01-20 Thread Michael Holzheu
When working with huge page pmds in general is not valid to directly
use pte functions like pte_present() because the hardware bit layout
of pmds and ptes can be different. This is the case on s390. Therefore
we have to convert the pmds first into a valid pte encoding with
huge_ptep_get(). So add the two missing functions calls to do this.

Reviewed-by: Gerald Schaefer <gerald.schae...@de.ibm.com>
Signed-off-by: Michael Holzheu <holz...@linux.vnet.ibm.com>
---
 fs/proc/task_mmu.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 65a1b6c..e287e32 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1520,7 +1520,7 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
pte_t *pte;
 
if (pmd_trans_huge_lock(pmd, vma, )) {
-   pte_t huge_pte = *(pte_t *)pmd;
+   pte_t huge_pte = huge_ptep_get((pte_t *)pmd);
struct page *page;
 
page = can_gather_numa_stats(huge_pte, vma, addr);
@@ -1548,18 +1548,19 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long 
addr,
 static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end, struct mm_walk *walk)
 {
+   pte_t huge_pte = huge_ptep_get(pte);
struct numa_maps *md;
struct page *page;
 
-   if (!pte_present(*pte))
+   if (!pte_present(huge_pte))
return 0;
 
-   page = pte_page(*pte);
+   page = pte_page(huge_pte);
if (!page)
return 0;
 
md = walk->private;
-   gather_stats(page, md, pte_dirty(*pte), 1);
+   gather_stats(page, md, pte_dirty(huge_pte), 1);
return 0;
 }
 
-- 
2.3.9



Re: [RFC][PATCH] sched: Start stopper early

2015-10-26 Thread Michael Holzheu
On Fri, 16 Oct 2015 14:01:25 +0200
Heiko Carstens  wrote:

> On Fri, Oct 16, 2015 at 11:57:06AM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 16, 2015 at 10:22:12AM +0200, Heiko Carstens wrote:
> > > So, actually this doesn't fix the bug and it _seems_ to be reproducible.
> > > 
> > > [ FWIW, I will be offline for the next two weeks ]
> > 
> > So the series from Oleg would be good to try; I can make a git tree for
> > you, or otherwise stuff the lot into a single patch.
> > 
> > Should I be talking to someone else whilst you're having down time?
> 
> Yes Michael Holzheu (on cc), can take care of this.

I tested Peter's "tip/master" and "tip/sched/core". With the following
commit our issue seems to be fixed:

2b621a085a ("stop_machine: Change cpu_stop_queue_two_works() to rely
on stopper->enabled")

When do you plan to merge the patch series in the mainline kernel?

Regards,
Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] sched: Start stopper early

2015-10-26 Thread Michael Holzheu
On Fri, 16 Oct 2015 14:01:25 +0200
Heiko Carstens <heiko.carst...@de.ibm.com> wrote:

> On Fri, Oct 16, 2015 at 11:57:06AM +0200, Peter Zijlstra wrote:
> > On Fri, Oct 16, 2015 at 10:22:12AM +0200, Heiko Carstens wrote:
> > > So, actually this doesn't fix the bug and it _seems_ to be reproducible.
> > > 
> > > [ FWIW, I will be offline for the next two weeks ]
> > 
> > So the series from Oleg would be good to try; I can make a git tree for
> > you, or otherwise stuff the lot into a single patch.
> > 
> > Should I be talking to someone else whilst you're having down time?
> 
> Yes Michael Holzheu (on cc), can take care of this.

I tested Peter's "tip/master" and "tip/sched/core". With the following
commit our issue seems to be fixed:

2b621a085a ("stop_machine: Change cpu_stop_queue_two_works() to rely
on stopper->enabled")

When do you plan to merge the patch series in the mainline kernel?

Regards,
Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] drivers/base: use cpu->node_id for from_nid

2015-08-06 Thread Michael Holzheu
On Wed, 5 Aug 2015 13:51:21 -0700
Greg Kroah-Hartman  wrote:

> On Thu, Jul 30, 2015 at 08:35:51PM +0200, Michael Holzheu wrote:
> > Hello Greg,
> > 
> > Is it possible to use "from_nid = cpu->node_id"?
> > 
> > Background:
> > 
> > I am currently working on (fake) NUMA support for s390. At startup, for
> > "deconfigured" CPUs, we don't know to which nodes the CPUs belong. Therefore
> > we always return node 0 for cpu_to_node().
> > 
> > For each present CPU the register_cpu() function is called which sets an
> > initial NUMA node via cpu_to_node(), which is then first node 0 for
> > "deconfigured" CPUs on s390.
> > 
> > After we "configure" a CPU we know to which node it belongs. Then when 
> > setting
> > a CPU online, the following is done in cpu_subsys_online():
> > 
> >  from_nid = cpu_to_node(cpuid); -> we return node x
> >  cpu_up(cpuid);
> >  to_nid = cpu_to_node(cpuid);   -> we return node x
> >  if (from_nid != to_nid)-> x != x -> false
> >change_cpu_under_node(cpu, from_nid, to_nid);
> > 
> > The result is that each CPU that was deconfigured at boot time stays in
> > node 0 because cpu_to_node() returns the same node before and after
> > setting the CPU online.
> > 
> > Using "cpu->node_id" for "from_nid" instead of calling cpu_to_node()
> > would help in our case.
> > 
> > Signed-off-by: Michael Holzheu 
> > ---
> >  drivers/base/cpu.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
> > index f160ea4..2dd889c 100644
> > --- a/drivers/base/cpu.c
> > +++ b/drivers/base/cpu.c
> > @@ -47,7 +47,7 @@ static int __ref cpu_subsys_online(struct device *dev)
> > int from_nid, to_nid;
> > int ret;
> >  
> > -   from_nid = cpu_to_node(cpuid);
> > +   from_nid = cpu->node_id;
> > if (from_nid == NUMA_NO_NODE)
> > return -ENODEV;
> >  
> 
> I really have no idea the answer to any of these questions, sorry...

No problem, in the meantime I found another solution for my problem.

But thanks for trying :-)
Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC PATCH] drivers/base: use cpu-node_id for from_nid

2015-08-06 Thread Michael Holzheu
On Wed, 5 Aug 2015 13:51:21 -0700
Greg Kroah-Hartman gre...@linuxfoundation.org wrote:

 On Thu, Jul 30, 2015 at 08:35:51PM +0200, Michael Holzheu wrote:
  Hello Greg,
  
  Is it possible to use from_nid = cpu-node_id?
  
  Background:
  
  I am currently working on (fake) NUMA support for s390. At startup, for
  deconfigured CPUs, we don't know to which nodes the CPUs belong. Therefore
  we always return node 0 for cpu_to_node().
  
  For each present CPU the register_cpu() function is called which sets an
  initial NUMA node via cpu_to_node(), which is then first node 0 for
  deconfigured CPUs on s390.
  
  After we configure a CPU we know to which node it belongs. Then when 
  setting
  a CPU online, the following is done in cpu_subsys_online():
  
   from_nid = cpu_to_node(cpuid); - we return node x
   cpu_up(cpuid);
   to_nid = cpu_to_node(cpuid);   - we return node x
   if (from_nid != to_nid)- x != x - false
 change_cpu_under_node(cpu, from_nid, to_nid);
  
  The result is that each CPU that was deconfigured at boot time stays in
  node 0 because cpu_to_node() returns the same node before and after
  setting the CPU online.
  
  Using cpu-node_id for from_nid instead of calling cpu_to_node()
  would help in our case.
  
  Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
  ---
   drivers/base/cpu.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)
  
  diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
  index f160ea4..2dd889c 100644
  --- a/drivers/base/cpu.c
  +++ b/drivers/base/cpu.c
  @@ -47,7 +47,7 @@ static int __ref cpu_subsys_online(struct device *dev)
  int from_nid, to_nid;
  int ret;
   
  -   from_nid = cpu_to_node(cpuid);
  +   from_nid = cpu-node_id;
  if (from_nid == NUMA_NO_NODE)
  return -ENODEV;
   
 
 I really have no idea the answer to any of these questions, sorry...

No problem, in the meantime I found another solution for my problem.

But thanks for trying :-)
Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] drivers/base: use cpu->node_id for from_nid

2015-07-30 Thread Michael Holzheu
Hello Greg,

Is it possible to use "from_nid = cpu->node_id"?

Background:

I am currently working on (fake) NUMA support for s390. At startup, for
"deconfigured" CPUs, we don't know to which nodes the CPUs belong. Therefore
we always return node 0 for cpu_to_node().

For each present CPU the register_cpu() function is called which sets an
initial NUMA node via cpu_to_node(), which is then first node 0 for
"deconfigured" CPUs on s390.

After we "configure" a CPU we know to which node it belongs. Then when setting
a CPU online, the following is done in cpu_subsys_online():

 from_nid = cpu_to_node(cpuid); -> we return node x
 cpu_up(cpuid);
 to_nid = cpu_to_node(cpuid);   -> we return node x
 if (from_nid != to_nid)-> x != x -> false
   change_cpu_under_node(cpu, from_nid, to_nid);

The result is that each CPU that was deconfigured at boot time stays in
node 0 because cpu_to_node() returns the same node before and after
setting the CPU online.

Using "cpu->node_id" for "from_nid" instead of calling cpu_to_node()
would help in our case.

Signed-off-by: Michael Holzheu 
---
 drivers/base/cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index f160ea4..2dd889c 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -47,7 +47,7 @@ static int __ref cpu_subsys_online(struct device *dev)
int from_nid, to_nid;
int ret;
 
-   from_nid = cpu_to_node(cpuid);
+   from_nid = cpu->node_id;
if (from_nid == NUMA_NO_NODE)
return -ENODEV;
 
-- 
2.3.8

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH] drivers/base: use cpu-node_id for from_nid

2015-07-30 Thread Michael Holzheu
Hello Greg,

Is it possible to use from_nid = cpu-node_id?

Background:

I am currently working on (fake) NUMA support for s390. At startup, for
deconfigured CPUs, we don't know to which nodes the CPUs belong. Therefore
we always return node 0 for cpu_to_node().

For each present CPU the register_cpu() function is called which sets an
initial NUMA node via cpu_to_node(), which is then first node 0 for
deconfigured CPUs on s390.

After we configure a CPU we know to which node it belongs. Then when setting
a CPU online, the following is done in cpu_subsys_online():

 from_nid = cpu_to_node(cpuid); - we return node x
 cpu_up(cpuid);
 to_nid = cpu_to_node(cpuid);   - we return node x
 if (from_nid != to_nid)- x != x - false
   change_cpu_under_node(cpu, from_nid, to_nid);

The result is that each CPU that was deconfigured at boot time stays in
node 0 because cpu_to_node() returns the same node before and after
setting the CPU online.

Using cpu-node_id for from_nid instead of calling cpu_to_node()
would help in our case.

Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
---
 drivers/base/cpu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/base/cpu.c b/drivers/base/cpu.c
index f160ea4..2dd889c 100644
--- a/drivers/base/cpu.c
+++ b/drivers/base/cpu.c
@@ -47,7 +47,7 @@ static int __ref cpu_subsys_online(struct device *dev)
int from_nid, to_nid;
int ret;
 
-   from_nid = cpu_to_node(cpuid);
+   from_nid = cpu-node_id;
if (from_nid == NUMA_NO_NODE)
return -ENODEV;
 
-- 
2.3.8

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-14 Thread Michael Holzheu
On Tue, 14 Jul 2015 10:09:20 -0400
Vivek Goyal  wrote:

> On Fri, Jul 10, 2015 at 11:14:06AM +0200, Michael Holzheu wrote:
> 
> [..]
> > What about the following patch:
> > ---
> > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > index 7a36fdc..7837c4e 100644
> > --- a/kernel/kexec.c
> > +++ b/kernel/kexec.c
> > @@ -1236,10 +1236,68 @@ int kexec_load_disabled;
> >  
> >  static DEFINE_MUTEX(kexec_mutex);
> >  
> > +static int __kexec_load(unsigned long entry, unsigned long nr_segments,
> 

[snip]

> > +
> > +failure_unmap_mem:
> 
> I don't like this tag "failure_unmap_mem". We are calling this both
> in success path as well as failure path. So why not simply call it "out".

Since the code is better readable now, I'm fine with "out" :-)

> 
> > +   if (flags & KEXEC_ON_CRASH)
> > +   crash_unmap_reserved_pages();
> > +   kimage_free(image);
> 
> Now kimage_free() is called with kexec_mutex held. Previously that was
> not the case. I hope that's not a problem.

Yes, I noticed that. But also in the original code there is already
one spot where kimage_free() is called under lock:

/*
 * In case of crash, new kernel gets loaded in reserved region. It is
 * same memory where old crash kernel might be loaded. Free any
 * current crash dump kernel before we corrupt it.
 */
if (flags & KEXEC_FILE_ON_CRASH)
kimage_free(xchg(_crash_image, NULL));

Therefore I thought it should be ok.

Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-14 Thread Michael Holzheu
On Tue, 14 Jul 2015 10:09:20 -0400
Vivek Goyal vgo...@redhat.com wrote:

 On Fri, Jul 10, 2015 at 11:14:06AM +0200, Michael Holzheu wrote:
 
 [..]
  What about the following patch:
  ---
  diff --git a/kernel/kexec.c b/kernel/kexec.c
  index 7a36fdc..7837c4e 100644
  --- a/kernel/kexec.c
  +++ b/kernel/kexec.c
  @@ -1236,10 +1236,68 @@ int kexec_load_disabled;
   
   static DEFINE_MUTEX(kexec_mutex);
   
  +static int __kexec_load(unsigned long entry, unsigned long nr_segments,
 

[snip]

  +
  +failure_unmap_mem:
 
 I don't like this tag failure_unmap_mem. We are calling this both
 in success path as well as failure path. So why not simply call it out.

Since the code is better readable now, I'm fine with out :-)

 
  +   if (flags  KEXEC_ON_CRASH)
  +   crash_unmap_reserved_pages();
  +   kimage_free(image);
 
 Now kimage_free() is called with kexec_mutex held. Previously that was
 not the case. I hope that's not a problem.

Yes, I noticed that. But also in the original code there is already
one spot where kimage_free() is called under lock:

/*
 * In case of crash, new kernel gets loaded in reserved region. It is
 * same memory where old crash kernel might be loaded. Free any
 * current crash dump kernel before we corrupt it.
 */
if (flags  KEXEC_FILE_ON_CRASH)
kimage_free(xchg(kexec_crash_image, NULL));

Therefore I thought it should be ok.

Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-10 Thread Michael Holzheu
On Fri, 10 Jul 2015 11:14:06 +0200
Michael Holzheu  wrote:

> On Fri, 10 Jul 2015 17:03:22 +0800
> Minfei Huang  wrote:

[snip]

> +static int __kexec_load(unsigned long entry, unsigned long nr_segments,
> + struct kexec_segment __user *segments,
> + unsigned long flags)
> +{
> + struct kimage **dest_image, *image;
> + unsigned long i;
> + int result;
> +
> + if (flags & KEXEC_ON_CRASH)
> + dest_image = _crash_image;
> + else
> + dest_image = _image;
> +
> + if (nr_segments == 0) {
> + /* Uninstall image */
> + kfree(xchg(dest_image, NULL));

Sorry, too fast today...
Should be of course not kfree, but:

kimage_free(dest_image, NULL));

Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-10 Thread Michael Holzheu
On Fri, 10 Jul 2015 17:03:22 +0800
Minfei Huang  wrote:

> On 07/10/15 at 10:54P, Michael Holzheu wrote:
> > On Fri, 10 Jul 2015 13:12:17 +0800
> > Minfei Huang  wrote:
> > 
> > > For some arch, kexec shall map the reserved pages, then use them, when
> > > we try to start the kdump service.
> > > 
> > > Now kexec will never unmap the reserved pages, once it fails to continue
> > > starting the kdump service. So we make a pair of map/unmap reserved
> > > pages whatever kexec fails or not in code path.
> > > 
> > > In order to make code readable, wrap a new function __kexec_load which
> > > contains all of the logic to deal with the image loading.
> > > 
> > > Signed-off-by: Minfei Huang 
> > > ---
> > > v3:
> > > - reconstruct the patch, wrap a new function to deal with the code logic, 
> > > based on Vivek and Michael's patch
> > > v2:
> > > - replace the "failure" label with "fail_unmap_pages"
> > > v1:
> > > - reconstruct the patch code
> > > ---
> > >  kernel/kexec.c | 112 
> > > -
> > >  1 file changed, 63 insertions(+), 49 deletions(-)
> > > 
> > > diff --git a/kernel/kexec.c b/kernel/kexec.c
> > > index a785c10..2232c90 100644
> > > --- a/kernel/kexec.c
> > > +++ b/kernel/kexec.c
> > > @@ -1247,10 +1247,71 @@ int kexec_load_disabled;
> > > 
> > >  static DEFINE_MUTEX(kexec_mutex);
> > > 
> > > +static int __kexec_load(unsigned long entry, unsigned long nr_segments,
> > > + struct kexec_segment __user *segments,
> > > + unsigned long flags)
> > > +{
> > > + int result = 0;
> > > + struct kimage **dest_image, *image;
> > > +
> > > + dest_image = _image;
> > > +
> > > + if (flags & KEXEC_ON_CRASH)
> > > + dest_image = _crash_image;
> > > +
> > > + if (nr_segments == 0) {
> > > + /* Install the new kernel, and  Uninstall the old */
> > > + image = xchg(dest_image, image);
> > > + kimage_free(image);
> > 
> > Well this is wrong and should probably be:
> > 
> > if (nr_segments == 0) {
> > /* Uninstall image */
> > image = xchg(dest_image, NULL);
> > kimage_free(image);
> > 
> 
> You are right. It should be what you commented.

And after rethinking a bit, I think a one liner and an early exit
would be better in this case:

 if (nr_segments == 0) {
 /* Uninstall image */
 kimage_free(xchg(dest_image, NULL));
 return 0;
 }

What about the following patch:
---
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 7a36fdc..7837c4e 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1236,10 +1236,68 @@ int kexec_load_disabled;
 
 static DEFINE_MUTEX(kexec_mutex);
 
+static int __kexec_load(unsigned long entry, unsigned long nr_segments,
+   struct kexec_segment __user *segments,
+   unsigned long flags)
+{
+   struct kimage **dest_image, *image;
+   unsigned long i;
+   int result;
+
+   if (flags & KEXEC_ON_CRASH)
+   dest_image = _crash_image;
+   else
+   dest_image = _image;
+
+   if (nr_segments == 0) {
+   /* Uninstall image */
+   kfree(xchg(dest_image, NULL));
+   return 0;
+   }
+   if (flags & KEXEC_ON_CRASH) {
+   /*
+* Loading another kernel to switch to if this one
+* crashes.  Free any current crash dump kernel before
+* we corrupt it.
+*/
+   kimage_free(xchg(_crash_image, NULL));
+   }
+
+   result = kimage_alloc_init(, entry, nr_segments, segments, flags);
+   if (result)
+   return result;
+
+   if (flags & KEXEC_ON_CRASH)
+   crash_map_reserved_pages();
+
+   if (flags & KEXEC_PRESERVE_CONTEXT)
+   image->preserve_context = 1;
+
+   result = machine_kexec_prepare(image);
+   if (result)
+   goto failure_unmap_mem;
+
+   for (i = 0; i < nr_segments; i++) {
+   result = kimage_load_segment(image, >segment[i]);
+   if (result)
+   goto failure_unmap_mem;
+   }
+
+   kimage_terminate(image);
+
+   /* Install the new kernel and uninstall the old */
+   image = xchg(dest_image, image);
+
+failure_unmap_mem:
+   if (flags & KEXEC_ON_CRASH)
+ 

Re: [PATCH v4] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-10 Thread Michael Holzheu
On Fri, 10 Jul 2015 13:12:17 +0800
Minfei Huang  wrote:

> For some arch, kexec shall map the reserved pages, then use them, when
> we try to start the kdump service.
> 
> Now kexec will never unmap the reserved pages, once it fails to continue
> starting the kdump service. So we make a pair of map/unmap reserved
> pages whatever kexec fails or not in code path.
> 
> In order to make code readable, wrap a new function __kexec_load which
> contains all of the logic to deal with the image loading.
> 
> Signed-off-by: Minfei Huang 
> ---
> v3:
> - reconstruct the patch, wrap a new function to deal with the code logic, 
> based on Vivek and Michael's patch
> v2:
> - replace the "failure" label with "fail_unmap_pages"
> v1:
> - reconstruct the patch code
> ---
>  kernel/kexec.c | 112 
> -
>  1 file changed, 63 insertions(+), 49 deletions(-)
> 
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index a785c10..2232c90 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -1247,10 +1247,71 @@ int kexec_load_disabled;
> 
>  static DEFINE_MUTEX(kexec_mutex);
> 
> +static int __kexec_load(unsigned long entry, unsigned long nr_segments,
> + struct kexec_segment __user *segments,
> + unsigned long flags)
> +{
> + int result = 0;
> + struct kimage **dest_image, *image;
> +
> + dest_image = _image;
> +
> + if (flags & KEXEC_ON_CRASH)
> + dest_image = _crash_image;
> +
> + if (nr_segments == 0) {
> + /* Install the new kernel, and  Uninstall the old */
> + image = xchg(dest_image, image);
> + kimage_free(image);

Well this is wrong and should probably be:

if (nr_segments == 0) {
/* Uninstall image */
image = xchg(dest_image, NULL);
kimage_free(image);

> + } else {
> + unsigned long i;
> +
> + if (flags & KEXEC_ON_CRASH) {
> + /*

[snip]

> + result = kimage_load_segment(image, >segment[i]);
> + if (result)
> + goto failure_unmap_mem;
> + }
> +
> + kimage_terminate(image);
> +
> + /* Install the new kernel, and  Uninstall the old */

Perhaps fix the comment: Remove superfluous blank and lowercase "uninstall"?

> + image = xchg(dest_image, image);
> +
> +failure_unmap_mem:
> + if (flags & KEXEC_ON_CRASH)
> + crash_unmap_reserved_pages();
> + kimage_free(image);

Here the update patch:
---
diff --git a/kernel/kexec.c b/kernel/kexec.c
index e686a39..2f5b4aa 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1249,8 +1249,8 @@ static int __kexec_load(unsigned long entry, unsigned 
long nr_segments,
dest_image = _crash_image;
 
if (nr_segments == 0) {
-   /* Install the new kernel, and  Uninstall the old */
-   image = xchg(dest_image, image);
+   /* Uninstall image */
+   image = xchg(dest_image, NULL);
kimage_free(image);
} else {
unsigned long i;
@@ -1287,7 +1287,7 @@ static int __kexec_load(unsigned long entry, unsigned 
long nr_segments,
 
kimage_terminate(image);
 
-   /* Install the new kernel, and  Uninstall the old */
+   /* Install the new kernel, and uninstall the old */
image = xchg(dest_image, image);
 
 failure_unmap_mem:

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-10 Thread Michael Holzheu
On Fri, 10 Jul 2015 12:05:27 +0800
Minfei Huang  wrote:

> On 07/09/15 at 05:54P, Michael Holzheu wrote:
> > On Tue, 7 Jul 2015 17:18:40 -0400
> > Vivek Goyal  wrote:
> > 
> > > On Thu, Jul 02, 2015 at 09:45:52AM +0800, Minfei Huang wrote:
> > 
> > [snip]
> > 
> > > I am thinking of moving kernel loading code in a separate function to
> > > make things little simpler. Right now it is confusing.
> > > 
> > > Can you please test attached patch. I have only compile tested it. This
> > > is primarily doing what you are doing but in a separate function. It
> > > seems more readable now.
> > 
> > The patch looks good to me. What about the following patch on top
> > to make things even more readable?
> > ---
> >  kernel/kexec.c |   50 +-
> >  1 file changed, 17 insertions(+), 33 deletions(-)
> > 
> > --- a/kernel/kexec.c
> > +++ b/kernel/kexec.c
> > @@ -1236,14 +1236,18 @@ int kexec_load_disabled;
> >  
> >  static DEFINE_MUTEX(kexec_mutex);
> >  
> > -static int __kexec_load(struct kimage **rimage, unsigned long entry,
> > -   unsigned long nr_segments,
> > +static int __kexec_load(unsigned long entry, unsigned long nr_segments,
> > struct kexec_segment __user * segments,
> > unsigned long flags)
> >  {
> > +   struct kimage *image, **dest_image;
> > unsigned long i;
> > int result;
> > -   struct kimage *image;
> > +
> > +   dest_image = (flags & KEXEC_ON_CRASH) ? _crash_image : 
> > _image;
> > +
> > +   if (nr_segments == 0)
> > +   return 0;
> 
> It is fine, if nr_segments is 0. So we should deal with this case like
> original kexec code.
> 
> >  
> > if (flags & KEXEC_ON_CRASH) {
> > /*
> > @@ -1251,7 +1255,6 @@ static int __kexec_load(struct kimage **
> >  * crashes.  Free any current crash dump kernel before
> >  * we corrupt it.
> >  */
> > -
> > kimage_free(xchg(_crash_image, NULL));
> > }
> >  
> > @@ -1267,30 +1270,29 @@ static int __kexec_load(struct kimage **
> >  
> > result = machine_kexec_prepare(image);
> > if (result)
> > -   goto out;
> > +   goto fail;
> >  
> > for (i = 0; i < nr_segments; i++) {
> > result = kimage_load_segment(image, >segment[i]);
> > if (result)
> > -   goto out;
> > +   goto fail;
> > }
> > -
> > kimage_terminate(image);
> > -   *rimage = image;
> > -out:
> > +   /* Install the new kernel, and  uninstall the old */
> > +   kimage_free(xchg(dest_image, image));
> > if (flags & KEXEC_ON_CRASH)
> > crash_unmap_reserved_pages();
> > -
> > -   /* Free image if there was an error */
> > -   if (result)
> > -   kimage_free(image);
> > +   return 0;
> > +fail:
> > +   if (flags & KEXEC_ON_CRASH)
> > +   crash_unmap_reserved_pages();
> > +   kimage_free(image);
> 
> Kernel release image again

Again? This is only done in the error case.

> , and will crash in here, since we do not
> assign the image to NULL when we release the image above.

Good catch, I should have set image=NULL at the beginning of __kexec_load().

Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-10 Thread Michael Holzheu
On Fri, 10 Jul 2015 12:05:27 +0800
Minfei Huang mnfhu...@gmail.com wrote:

 On 07/09/15 at 05:54P, Michael Holzheu wrote:
  On Tue, 7 Jul 2015 17:18:40 -0400
  Vivek Goyal vgo...@redhat.com wrote:
  
   On Thu, Jul 02, 2015 at 09:45:52AM +0800, Minfei Huang wrote:
  
  [snip]
  
   I am thinking of moving kernel loading code in a separate function to
   make things little simpler. Right now it is confusing.
   
   Can you please test attached patch. I have only compile tested it. This
   is primarily doing what you are doing but in a separate function. It
   seems more readable now.
  
  The patch looks good to me. What about the following patch on top
  to make things even more readable?
  ---
   kernel/kexec.c |   50 +-
   1 file changed, 17 insertions(+), 33 deletions(-)
  
  --- a/kernel/kexec.c
  +++ b/kernel/kexec.c
  @@ -1236,14 +1236,18 @@ int kexec_load_disabled;
   
   static DEFINE_MUTEX(kexec_mutex);
   
  -static int __kexec_load(struct kimage **rimage, unsigned long entry,
  -   unsigned long nr_segments,
  +static int __kexec_load(unsigned long entry, unsigned long nr_segments,
  struct kexec_segment __user * segments,
  unsigned long flags)
   {
  +   struct kimage *image, **dest_image;
  unsigned long i;
  int result;
  -   struct kimage *image;
  +
  +   dest_image = (flags  KEXEC_ON_CRASH) ? kexec_crash_image : 
  kexec_image;
  +
  +   if (nr_segments == 0)
  +   return 0;
 
 It is fine, if nr_segments is 0. So we should deal with this case like
 original kexec code.
 
   
  if (flags  KEXEC_ON_CRASH) {
  /*
  @@ -1251,7 +1255,6 @@ static int __kexec_load(struct kimage **
   * crashes.  Free any current crash dump kernel before
   * we corrupt it.
   */
  -
  kimage_free(xchg(kexec_crash_image, NULL));
  }
   
  @@ -1267,30 +1270,29 @@ static int __kexec_load(struct kimage **
   
  result = machine_kexec_prepare(image);
  if (result)
  -   goto out;
  +   goto fail;
   
  for (i = 0; i  nr_segments; i++) {
  result = kimage_load_segment(image, image-segment[i]);
  if (result)
  -   goto out;
  +   goto fail;
  }
  -
  kimage_terminate(image);
  -   *rimage = image;
  -out:
  +   /* Install the new kernel, and  uninstall the old */
  +   kimage_free(xchg(dest_image, image));
  if (flags  KEXEC_ON_CRASH)
  crash_unmap_reserved_pages();
  -
  -   /* Free image if there was an error */
  -   if (result)
  -   kimage_free(image);
  +   return 0;
  +fail:
  +   if (flags  KEXEC_ON_CRASH)
  +   crash_unmap_reserved_pages();
  +   kimage_free(image);
 
 Kernel release image again

Again? This is only done in the error case.

 , and will crash in here, since we do not
 assign the image to NULL when we release the image above.

Good catch, I should have set image=NULL at the beginning of __kexec_load().

Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-10 Thread Michael Holzheu
On Fri, 10 Jul 2015 13:12:17 +0800
Minfei Huang mnfhu...@gmail.com wrote:

 For some arch, kexec shall map the reserved pages, then use them, when
 we try to start the kdump service.
 
 Now kexec will never unmap the reserved pages, once it fails to continue
 starting the kdump service. So we make a pair of map/unmap reserved
 pages whatever kexec fails or not in code path.
 
 In order to make code readable, wrap a new function __kexec_load which
 contains all of the logic to deal with the image loading.
 
 Signed-off-by: Minfei Huang mnfhu...@gmail.com
 ---
 v3:
 - reconstruct the patch, wrap a new function to deal with the code logic, 
 based on Vivek and Michael's patch
 v2:
 - replace the failure label with fail_unmap_pages
 v1:
 - reconstruct the patch code
 ---
  kernel/kexec.c | 112 
 -
  1 file changed, 63 insertions(+), 49 deletions(-)
 
 diff --git a/kernel/kexec.c b/kernel/kexec.c
 index a785c10..2232c90 100644
 --- a/kernel/kexec.c
 +++ b/kernel/kexec.c
 @@ -1247,10 +1247,71 @@ int kexec_load_disabled;
 
  static DEFINE_MUTEX(kexec_mutex);
 
 +static int __kexec_load(unsigned long entry, unsigned long nr_segments,
 + struct kexec_segment __user *segments,
 + unsigned long flags)
 +{
 + int result = 0;
 + struct kimage **dest_image, *image;
 +
 + dest_image = kexec_image;
 +
 + if (flags  KEXEC_ON_CRASH)
 + dest_image = kexec_crash_image;
 +
 + if (nr_segments == 0) {
 + /* Install the new kernel, and  Uninstall the old */
 + image = xchg(dest_image, image);
 + kimage_free(image);

Well this is wrong and should probably be:

if (nr_segments == 0) {
/* Uninstall image */
image = xchg(dest_image, NULL);
kimage_free(image);

 + } else {
 + unsigned long i;
 +
 + if (flags  KEXEC_ON_CRASH) {
 + /*

[snip]

 + result = kimage_load_segment(image, image-segment[i]);
 + if (result)
 + goto failure_unmap_mem;
 + }
 +
 + kimage_terminate(image);
 +
 + /* Install the new kernel, and  Uninstall the old */

Perhaps fix the comment: Remove superfluous blank and lowercase uninstall?

 + image = xchg(dest_image, image);
 +
 +failure_unmap_mem:
 + if (flags  KEXEC_ON_CRASH)
 + crash_unmap_reserved_pages();
 + kimage_free(image);

Here the update patch:
---
diff --git a/kernel/kexec.c b/kernel/kexec.c
index e686a39..2f5b4aa 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1249,8 +1249,8 @@ static int __kexec_load(unsigned long entry, unsigned 
long nr_segments,
dest_image = kexec_crash_image;
 
if (nr_segments == 0) {
-   /* Install the new kernel, and  Uninstall the old */
-   image = xchg(dest_image, image);
+   /* Uninstall image */
+   image = xchg(dest_image, NULL);
kimage_free(image);
} else {
unsigned long i;
@@ -1287,7 +1287,7 @@ static int __kexec_load(unsigned long entry, unsigned 
long nr_segments,
 
kimage_terminate(image);
 
-   /* Install the new kernel, and  Uninstall the old */
+   /* Install the new kernel, and uninstall the old */
image = xchg(dest_image, image);
 
 failure_unmap_mem:

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-10 Thread Michael Holzheu
On Fri, 10 Jul 2015 17:03:22 +0800
Minfei Huang mnfhu...@gmail.com wrote:

 On 07/10/15 at 10:54P, Michael Holzheu wrote:
  On Fri, 10 Jul 2015 13:12:17 +0800
  Minfei Huang mnfhu...@gmail.com wrote:
  
   For some arch, kexec shall map the reserved pages, then use them, when
   we try to start the kdump service.
   
   Now kexec will never unmap the reserved pages, once it fails to continue
   starting the kdump service. So we make a pair of map/unmap reserved
   pages whatever kexec fails or not in code path.
   
   In order to make code readable, wrap a new function __kexec_load which
   contains all of the logic to deal with the image loading.
   
   Signed-off-by: Minfei Huang mnfhu...@gmail.com
   ---
   v3:
   - reconstruct the patch, wrap a new function to deal with the code logic, 
   based on Vivek and Michael's patch
   v2:
   - replace the failure label with fail_unmap_pages
   v1:
   - reconstruct the patch code
   ---
kernel/kexec.c | 112 
   -
1 file changed, 63 insertions(+), 49 deletions(-)
   
   diff --git a/kernel/kexec.c b/kernel/kexec.c
   index a785c10..2232c90 100644
   --- a/kernel/kexec.c
   +++ b/kernel/kexec.c
   @@ -1247,10 +1247,71 @@ int kexec_load_disabled;
   
static DEFINE_MUTEX(kexec_mutex);
   
   +static int __kexec_load(unsigned long entry, unsigned long nr_segments,
   + struct kexec_segment __user *segments,
   + unsigned long flags)
   +{
   + int result = 0;
   + struct kimage **dest_image, *image;
   +
   + dest_image = kexec_image;
   +
   + if (flags  KEXEC_ON_CRASH)
   + dest_image = kexec_crash_image;
   +
   + if (nr_segments == 0) {
   + /* Install the new kernel, and  Uninstall the old */
   + image = xchg(dest_image, image);
   + kimage_free(image);
  
  Well this is wrong and should probably be:
  
  if (nr_segments == 0) {
  /* Uninstall image */
  image = xchg(dest_image, NULL);
  kimage_free(image);
  
 
 You are right. It should be what you commented.

And after rethinking a bit, I think a one liner and an early exit
would be better in this case:

 if (nr_segments == 0) {
 /* Uninstall image */
 kimage_free(xchg(dest_image, NULL));
 return 0;
 }

What about the following patch:
---
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 7a36fdc..7837c4e 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1236,10 +1236,68 @@ int kexec_load_disabled;
 
 static DEFINE_MUTEX(kexec_mutex);
 
+static int __kexec_load(unsigned long entry, unsigned long nr_segments,
+   struct kexec_segment __user *segments,
+   unsigned long flags)
+{
+   struct kimage **dest_image, *image;
+   unsigned long i;
+   int result;
+
+   if (flags  KEXEC_ON_CRASH)
+   dest_image = kexec_crash_image;
+   else
+   dest_image = kexec_image;
+
+   if (nr_segments == 0) {
+   /* Uninstall image */
+   kfree(xchg(dest_image, NULL));
+   return 0;
+   }
+   if (flags  KEXEC_ON_CRASH) {
+   /*
+* Loading another kernel to switch to if this one
+* crashes.  Free any current crash dump kernel before
+* we corrupt it.
+*/
+   kimage_free(xchg(kexec_crash_image, NULL));
+   }
+
+   result = kimage_alloc_init(image, entry, nr_segments, segments, flags);
+   if (result)
+   return result;
+
+   if (flags  KEXEC_ON_CRASH)
+   crash_map_reserved_pages();
+
+   if (flags  KEXEC_PRESERVE_CONTEXT)
+   image-preserve_context = 1;
+
+   result = machine_kexec_prepare(image);
+   if (result)
+   goto failure_unmap_mem;
+
+   for (i = 0; i  nr_segments; i++) {
+   result = kimage_load_segment(image, image-segment[i]);
+   if (result)
+   goto failure_unmap_mem;
+   }
+
+   kimage_terminate(image);
+
+   /* Install the new kernel and uninstall the old */
+   image = xchg(dest_image, image);
+
+failure_unmap_mem:
+   if (flags  KEXEC_ON_CRASH)
+   crash_unmap_reserved_pages();
+   kimage_free(image);
+   return result;
+}
+
 SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
struct kexec_segment __user *, segments, unsigned long, flags)
 {
-   struct kimage **dest_image, *image;
int result;
 
/* We only trust the superuser with rebooting the system. */
@@ -1264,9 +1322,6 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, 
unsigned long, nr_segments,
if (nr_segments  KEXEC_SEGMENT_MAX)
return -EINVAL;
 
-   image = NULL;
-   result = 0;
-
/* Because we write directly

Re: [PATCH v4] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-10 Thread Michael Holzheu
On Fri, 10 Jul 2015 11:14:06 +0200
Michael Holzheu holz...@linux.vnet.ibm.com wrote:

 On Fri, 10 Jul 2015 17:03:22 +0800
 Minfei Huang mnfhu...@gmail.com wrote:

[snip]

 +static int __kexec_load(unsigned long entry, unsigned long nr_segments,
 + struct kexec_segment __user *segments,
 + unsigned long flags)
 +{
 + struct kimage **dest_image, *image;
 + unsigned long i;
 + int result;
 +
 + if (flags  KEXEC_ON_CRASH)
 + dest_image = kexec_crash_image;
 + else
 + dest_image = kexec_image;
 +
 + if (nr_segments == 0) {
 + /* Uninstall image */
 + kfree(xchg(dest_image, NULL));

Sorry, too fast today...
Should be of course not kfree, but:

kimage_free(dest_image, NULL));

Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-09 Thread Michael Holzheu
On Tue, 7 Jul 2015 17:18:40 -0400
Vivek Goyal  wrote:

> On Thu, Jul 02, 2015 at 09:45:52AM +0800, Minfei Huang wrote:

[snip]

> I am thinking of moving kernel loading code in a separate function to
> make things little simpler. Right now it is confusing.
> 
> Can you please test attached patch. I have only compile tested it. This
> is primarily doing what you are doing but in a separate function. It
> seems more readable now.

The patch looks good to me. What about the following patch on top
to make things even more readable?
---
 kernel/kexec.c |   50 +-
 1 file changed, 17 insertions(+), 33 deletions(-)

--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1236,14 +1236,18 @@ int kexec_load_disabled;
 
 static DEFINE_MUTEX(kexec_mutex);
 
-static int __kexec_load(struct kimage **rimage, unsigned long entry,
-   unsigned long nr_segments,
+static int __kexec_load(unsigned long entry, unsigned long nr_segments,
struct kexec_segment __user * segments,
unsigned long flags)
 {
+   struct kimage *image, **dest_image;
unsigned long i;
int result;
-   struct kimage *image;
+
+   dest_image = (flags & KEXEC_ON_CRASH) ? _crash_image : 
_image;
+
+   if (nr_segments == 0)
+   return 0;
 
if (flags & KEXEC_ON_CRASH) {
/*
@@ -1251,7 +1255,6 @@ static int __kexec_load(struct kimage **
 * crashes.  Free any current crash dump kernel before
 * we corrupt it.
 */
-
kimage_free(xchg(_crash_image, NULL));
}
 
@@ -1267,30 +1270,29 @@ static int __kexec_load(struct kimage **
 
result = machine_kexec_prepare(image);
if (result)
-   goto out;
+   goto fail;
 
for (i = 0; i < nr_segments; i++) {
result = kimage_load_segment(image, >segment[i]);
if (result)
-   goto out;
+   goto fail;
}
-
kimage_terminate(image);
-   *rimage = image;
-out:
+   /* Install the new kernel, and  uninstall the old */
+   kimage_free(xchg(dest_image, image));
if (flags & KEXEC_ON_CRASH)
crash_unmap_reserved_pages();
-
-   /* Free image if there was an error */
-   if (result)
-   kimage_free(image);
+   return 0;
+fail:
+   if (flags & KEXEC_ON_CRASH)
+   crash_unmap_reserved_pages();
+   kimage_free(image);
return result;
 }
 
 SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
struct kexec_segment __user *, segments, unsigned long, flags)
 {
-   struct kimage **dest_image, *image;
int result;
 
/* We only trust the superuser with rebooting the system. */
@@ -1315,9 +1317,6 @@ SYSCALL_DEFINE4(kexec_load, unsigned lon
if (nr_segments > KEXEC_SEGMENT_MAX)
return -EINVAL;
 
-   image = NULL;
-   result = 0;
-
/* Because we write directly to the reserved memory
 * region when loading crash kernels we need a mutex here to
 * prevent multiple crash  kernels from attempting to load
@@ -1329,24 +1328,9 @@ SYSCALL_DEFINE4(kexec_load, unsigned lon
if (!mutex_trylock(_mutex))
return -EBUSY;
 
-   dest_image = _image;
-   if (flags & KEXEC_ON_CRASH)
-   dest_image = _crash_image;
-
/* Load new kernel */
-   if (nr_segments > 0) {
-   result = __kexec_load(, entry, nr_segments, segments,
- flags);
-   if (result)
-   goto out;
-   }
-
-   /* Install the new kernel, and  Uninstall the old */
-   image = xchg(dest_image, image);
-
-out:
+   result = __kexec_load(entry, nr_segments, segments, flags);
mutex_unlock(_mutex);
-   kimage_free(image);
 
return result;
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-09 Thread Michael Holzheu
On Tue, 7 Jul 2015 17:18:40 -0400
Vivek Goyal vgo...@redhat.com wrote:

 On Thu, Jul 02, 2015 at 09:45:52AM +0800, Minfei Huang wrote:

[snip]

 I am thinking of moving kernel loading code in a separate function to
 make things little simpler. Right now it is confusing.
 
 Can you please test attached patch. I have only compile tested it. This
 is primarily doing what you are doing but in a separate function. It
 seems more readable now.

The patch looks good to me. What about the following patch on top
to make things even more readable?
---
 kernel/kexec.c |   50 +-
 1 file changed, 17 insertions(+), 33 deletions(-)

--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1236,14 +1236,18 @@ int kexec_load_disabled;
 
 static DEFINE_MUTEX(kexec_mutex);
 
-static int __kexec_load(struct kimage **rimage, unsigned long entry,
-   unsigned long nr_segments,
+static int __kexec_load(unsigned long entry, unsigned long nr_segments,
struct kexec_segment __user * segments,
unsigned long flags)
 {
+   struct kimage *image, **dest_image;
unsigned long i;
int result;
-   struct kimage *image;
+
+   dest_image = (flags  KEXEC_ON_CRASH) ? kexec_crash_image : 
kexec_image;
+
+   if (nr_segments == 0)
+   return 0;
 
if (flags  KEXEC_ON_CRASH) {
/*
@@ -1251,7 +1255,6 @@ static int __kexec_load(struct kimage **
 * crashes.  Free any current crash dump kernel before
 * we corrupt it.
 */
-
kimage_free(xchg(kexec_crash_image, NULL));
}
 
@@ -1267,30 +1270,29 @@ static int __kexec_load(struct kimage **
 
result = machine_kexec_prepare(image);
if (result)
-   goto out;
+   goto fail;
 
for (i = 0; i  nr_segments; i++) {
result = kimage_load_segment(image, image-segment[i]);
if (result)
-   goto out;
+   goto fail;
}
-
kimage_terminate(image);
-   *rimage = image;
-out:
+   /* Install the new kernel, and  uninstall the old */
+   kimage_free(xchg(dest_image, image));
if (flags  KEXEC_ON_CRASH)
crash_unmap_reserved_pages();
-
-   /* Free image if there was an error */
-   if (result)
-   kimage_free(image);
+   return 0;
+fail:
+   if (flags  KEXEC_ON_CRASH)
+   crash_unmap_reserved_pages();
+   kimage_free(image);
return result;
 }
 
 SYSCALL_DEFINE4(kexec_load, unsigned long, entry, unsigned long, nr_segments,
struct kexec_segment __user *, segments, unsigned long, flags)
 {
-   struct kimage **dest_image, *image;
int result;
 
/* We only trust the superuser with rebooting the system. */
@@ -1315,9 +1317,6 @@ SYSCALL_DEFINE4(kexec_load, unsigned lon
if (nr_segments  KEXEC_SEGMENT_MAX)
return -EINVAL;
 
-   image = NULL;
-   result = 0;
-
/* Because we write directly to the reserved memory
 * region when loading crash kernels we need a mutex here to
 * prevent multiple crash  kernels from attempting to load
@@ -1329,24 +1328,9 @@ SYSCALL_DEFINE4(kexec_load, unsigned lon
if (!mutex_trylock(kexec_mutex))
return -EBUSY;
 
-   dest_image = kexec_image;
-   if (flags  KEXEC_ON_CRASH)
-   dest_image = kexec_crash_image;
-
/* Load new kernel */
-   if (nr_segments  0) {
-   result = __kexec_load(image, entry, nr_segments, segments,
- flags);
-   if (result)
-   goto out;
-   }
-
-   /* Install the new kernel, and  Uninstall the old */
-   image = xchg(dest_image, image);
-
-out:
+   result = __kexec_load(entry, nr_segments, segments, flags);
mutex_unlock(kexec_mutex);
-   kimage_free(image);
 
return result;
 }

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-01 Thread Michael Holzheu
Hello Minfei,

Regarding functionality your patch looks ok for me.
But the code is not easy to read.

What about replacing the "failure" label with "fail_unmap_pages"?

Michael

On Tue, 30 Jun 2015 13:44:46 +0800
Minfei Huang  wrote:

> For some arch, kexec shall map the reserved pages, then use them, when
> we try to start the kdump service.
> 
> Now kexec will never unmap the reserved pages, once it fails to continue
> starting the kdump service.
> 
> Make a pair of reserved pages in kdump starting path, whatever kexec
> fails or not.
> 
> Signed-off-by: Minfei Huang 
> ---
>  kernel/kexec.c | 26 ++
>  1 file changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index 4589899..68f6dfb 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -1291,35 +1291,37 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, 
> unsigned long, nr_segments,
>*/
> 
>   kimage_free(xchg(_crash_image, NULL));
> - result = kimage_alloc_init(, entry, nr_segments,
> -segments, flags);
> - crash_map_reserved_pages();
> - } else {
> - /* Loading another kernel to reboot into. */
> -
> - result = kimage_alloc_init(, entry, nr_segments,
> -segments, flags);
>   }
> +
> + result = kimage_alloc_init(, entry, nr_segments,
> + segments, flags);
>   if (result)
>   goto out;
> 
> + if (flags & KEXEC_ON_CRASH)
> + crash_map_reserved_pages();
> +
>   if (flags & KEXEC_PRESERVE_CONTEXT)
>   image->preserve_context = 1;
>   result = machine_kexec_prepare(image);
>   if (result)
> - goto out;
> + goto failure;
> 
>   for (i = 0; i < nr_segments; i++) {
>   result = kimage_load_segment(image, >segment[i]);
>   if (result)
> - goto out;
> + goto failure;
>   }
>   kimage_terminate(image);
> +
> +failure:
>   if (flags & KEXEC_ON_CRASH)
>   crash_unmap_reserved_pages();
>   }
> - /* Install the new kernel, and  Uninstall the old */
> - image = xchg(dest_image, image);
> +
> + if (result == 0)
> + /* Install the new kernel, and  Uninstall the old */
> + image = xchg(dest_image, image);
> 
>  out:
>   mutex_unlock(_mutex);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] kexec: Make a pair of map and unmap reserved pages when kdump fails to start

2015-07-01 Thread Michael Holzheu
Hello Minfei,

Regarding functionality your patch looks ok for me.
But the code is not easy to read.

What about replacing the failure label with fail_unmap_pages?

Michael

On Tue, 30 Jun 2015 13:44:46 +0800
Minfei Huang mnfhu...@gmail.com wrote:

 For some arch, kexec shall map the reserved pages, then use them, when
 we try to start the kdump service.
 
 Now kexec will never unmap the reserved pages, once it fails to continue
 starting the kdump service.
 
 Make a pair of reserved pages in kdump starting path, whatever kexec
 fails or not.
 
 Signed-off-by: Minfei Huang mnfhu...@gmail.com
 ---
  kernel/kexec.c | 26 ++
  1 file changed, 14 insertions(+), 12 deletions(-)
 
 diff --git a/kernel/kexec.c b/kernel/kexec.c
 index 4589899..68f6dfb 100644
 --- a/kernel/kexec.c
 +++ b/kernel/kexec.c
 @@ -1291,35 +1291,37 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, 
 unsigned long, nr_segments,
*/
 
   kimage_free(xchg(kexec_crash_image, NULL));
 - result = kimage_alloc_init(image, entry, nr_segments,
 -segments, flags);
 - crash_map_reserved_pages();
 - } else {
 - /* Loading another kernel to reboot into. */
 -
 - result = kimage_alloc_init(image, entry, nr_segments,
 -segments, flags);
   }
 +
 + result = kimage_alloc_init(image, entry, nr_segments,
 + segments, flags);
   if (result)
   goto out;
 
 + if (flags  KEXEC_ON_CRASH)
 + crash_map_reserved_pages();
 +
   if (flags  KEXEC_PRESERVE_CONTEXT)
   image-preserve_context = 1;
   result = machine_kexec_prepare(image);
   if (result)
 - goto out;
 + goto failure;
 
   for (i = 0; i  nr_segments; i++) {
   result = kimage_load_segment(image, image-segment[i]);
   if (result)
 - goto out;
 + goto failure;
   }
   kimage_terminate(image);
 +
 +failure:
   if (flags  KEXEC_ON_CRASH)
   crash_unmap_reserved_pages();
   }
 - /* Install the new kernel, and  Uninstall the old */
 - image = xchg(dest_image, image);
 +
 + if (result == 0)
 + /* Install the new kernel, and  Uninstall the old */
 + image = xchg(dest_image, image);
 
  out:
   mutex_unlock(kexec_mutex);

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] s390/sclp: pass timeout as HZ independent value

2015-05-29 Thread Michael Holzheu
On Fri, 29 May 2015 13:49:36 +0200
Nicholas Mc Guire  wrote:

> On Fri, 29 May 2015, Heiko Carstens wrote:
> 
> > On Fri, May 29, 2015 at 11:51:54AM +0200, Nicholas Mc Guire wrote:
> > > On Fri, 29 May 2015, Heiko Carstens wrote:
> > > > Yes, the orginal code seems to be broken. Since I've no idea what the 
> > > > intended
> > > > timeout value should be, let's simply ask Michael, who wrote this code 
> > > > eight
> > > > years ago ;)
> > > > While these lines get touched anyway, it would make sense to use
> > > > schedule_timeout_interruptible() instead, and get rid of 
> > > > set_current_state().
> > > >
> > > Well that is not really equivalent
> > > schedule_timeout_interruptible() is doing
> > > __set_current_state not set_current_state
> > > so that would drop the mb() and no WRITE_ONCE()
> > 
> > And how does that matter in this case?
> >
> I do not know - did not look into it - in any case
> its not a 1:1 API consolidation that all I wanted to point out
> before changing anything.

I agree, 1:1 consolidation is better here.

But I would like to remove the SDIAS_SLEEP_TICKS define and just
use HZ / 2 in schedule_timeout(). Could you please resend the
updated patch? We will then add it to our tree.

Thanks
Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH RFC] s390/sclp: pass timeout as HZ independent value

2015-05-29 Thread Michael Holzheu
On Fri, 29 May 2015 13:49:36 +0200
Nicholas Mc Guire der.h...@hofr.at wrote:

 On Fri, 29 May 2015, Heiko Carstens wrote:
 
  On Fri, May 29, 2015 at 11:51:54AM +0200, Nicholas Mc Guire wrote:
   On Fri, 29 May 2015, Heiko Carstens wrote:
Yes, the orginal code seems to be broken. Since I've no idea what the 
intended
timeout value should be, let's simply ask Michael, who wrote this code 
eight
years ago ;)
While these lines get touched anyway, it would make sense to use
schedule_timeout_interruptible() instead, and get rid of 
set_current_state().
   
   Well that is not really equivalent
   schedule_timeout_interruptible() is doing
   __set_current_state not set_current_state
   so that would drop the mb() and no WRITE_ONCE()
  
  And how does that matter in this case?
 
 I do not know - did not look into it - in any case
 its not a 1:1 API consolidation that all I wanted to point out
 before changing anything.

I agree, 1:1 consolidation is better here.

But I would like to remove the SDIAS_SLEEP_TICKS define and just
use HZ / 2 in schedule_timeout(). Could you please resend the
updated patch? We will then add it to our tree.

Thanks
Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] samples/bpf: Fix test_maps/bpf_get_next_key() test

2015-01-23 Thread Michael Holzheu
On Thu, 22 Jan 2015 09:32:43 -0800
Alexei Starovoitov  wrote:

> On Thu, Jan 22, 2015 at 8:01 AM, Michael Holzheu
>  wrote:
> > Looks like the "test_maps" test case expects to get the keys in
> > the wrong order when iterating over the elements:
> >
> > test_maps: samples/bpf/test_maps.c:79: test_hashmap_sanity: Assertion
> > `bpf_get_next_key(map_fd, , _key) == 0 && next_key == 2' failed.
> > Aborted
> >
> > Fix this and test for the correct order.
> 
> that will break this test on x86...
> we need to understand first why the order of two elements
> came out different on s390...
> Could it be that jhash() produced different hash for the same
> values on x86 vs s390 ?

Yes I think jhash() produces different results for input > 12 bytes
on big and little endian machines because of the following code
in include/linux/jhash.h:

while (length > 12) {
a += __get_unaligned_cpu32(k);
b += __get_unaligned_cpu32(k + 4);
c += __get_unaligned_cpu32(k + 8);
__jhash_mix(a, b, c);
length -= 12;
k += 12;
}

The contents of "k" is directly used as u32 and the result
of "__get_unaligned_cpu32(k)" is different for big and
little endian.

Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] samples/bpf: Fix test_maps/bpf_get_next_key() test

2015-01-23 Thread Michael Holzheu
On Thu, 22 Jan 2015 09:32:43 -0800
Alexei Starovoitov alexei.starovoi...@gmail.com wrote:

 On Thu, Jan 22, 2015 at 8:01 AM, Michael Holzheu
 holz...@linux.vnet.ibm.com wrote:
  Looks like the test_maps test case expects to get the keys in
  the wrong order when iterating over the elements:
 
  test_maps: samples/bpf/test_maps.c:79: test_hashmap_sanity: Assertion
  `bpf_get_next_key(map_fd, key, next_key) == 0  next_key == 2' failed.
  Aborted
 
  Fix this and test for the correct order.
 
 that will break this test on x86...
 we need to understand first why the order of two elements
 came out different on s390...
 Could it be that jhash() produced different hash for the same
 values on x86 vs s390 ?

Yes I think jhash() produces different results for input  12 bytes
on big and little endian machines because of the following code
in include/linux/jhash.h:

while (length  12) {
a += __get_unaligned_cpu32(k);
b += __get_unaligned_cpu32(k + 4);
c += __get_unaligned_cpu32(k + 8);
__jhash_mix(a, b, c);
length -= 12;
k += 12;
}

The contents of k is directly used as u32 and the result
of __get_unaligned_cpu32(k) is different for big and
little endian.

Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bpf: Call rcu_read_unlock() before copy_to_user()

2015-01-22 Thread Michael Holzheu
On Thu, 22 Jan 2015 09:27:21 -0800
Alexei Starovoitov  wrote:

> On Thu, Jan 22, 2015 at 7:57 AM, Michael Holzheu
>  wrote:
> > We must not hold locks when calling copy_to_user():
> >
> > BUG: sleeping function called from invalid context at mm/memory.c:3732
> > in_atomic(): 0, irqs_disabled(): 0, pid: 671, name: test_maps
> > 1 lock held by test_maps/671:
> >  #0:  (rcu_read_lock){..}, at: [<00264190>] 
> > map_lookup_elem+0xe8/0x260
> > Preemption disabled at:[<001be3b6>] vprintk_default+0x56/0x68
> >
> > CPU: 0 PID: 671 Comm: test_maps Not tainted 3.19.0-rc5-00117-g5eb11d6-dirty 
> > #424
> >1e447bb0 1e447c40 0002 
> >1e447ce0 1e447c58 1e447c58 00115c8a
> > 00c08246 00c27e8a 000b
> >1e447ca0 1e447c40  
> > 00115c8a 1e447c40 1e447ca0
> > Call Trace:
> > ([<00115b7e>] show_trace+0x12e/0x150)
> >  [<00115c40>] show_stack+0xa0/0x100
> >  [<009b163c>] dump_stack+0x74/0xc8
> >  [<0017424a>] ___might_sleep+0x23a/0x248
> >  [<002b58e8>] might_fault+0x70/0xe8
> >  [<00264230>] map_lookup_elem+0x188/0x260
> >  [<00264716>] SyS_bpf+0x20e/0x840
> >  [<009bbe3a>] system_call+0xd6/0x24c
> >  [<03fffd15f566>] 0x3fffd15f566
> > 1 lock held by test_maps/671:
> >  #0:  (rcu_read_lock){..}, at: [<00264190>] 
> > map_lookup_elem+0xe8/0x260
> >
> > So call rcu_read_unlock() before copy_to_user(). We can
> > release the lock earlier because it is not needed for copy_to_user().
> 
> we cannot move the rcu unlock this way, since it protects the value.
> So we need to copy the value while still under rcu.

Ok, right. I assume you will provide the correct fix.

> I'm puzzled how I missed this warning.
> I guess you have CONFIG_PREEMPT_RCU=y ?

Yes.

Regards,
Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] samples/bpf: Fix test_maps/bpf_get_next_key() test

2015-01-22 Thread Michael Holzheu
Looks like the "test_maps" test case expects to get the keys in
the wrong order when iterating over the elements:

test_maps: samples/bpf/test_maps.c:79: test_hashmap_sanity: Assertion
`bpf_get_next_key(map_fd, , _key) == 0 && next_key == 2' failed.
Aborted

Fix this and test for the correct order.  

Signed-off-by: Michael Holzheu 
---
 samples/bpf/test_maps.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/samples/bpf/test_maps.c
+++ b/samples/bpf/test_maps.c
@@ -69,9 +69,9 @@ static void test_hashmap_sanity(int i, v
 
/* iterate over two elements */
assert(bpf_get_next_key(map_fd, , _key) == 0 &&
-  next_key == 2);
-   assert(bpf_get_next_key(map_fd, _key, _key) == 0 &&
   next_key == 1);
+   assert(bpf_get_next_key(map_fd, _key, _key) == 0 &&
+  next_key == 2);
assert(bpf_get_next_key(map_fd, _key, _key) == -1 &&
   errno == ENOENT);
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] bpf: Call rcu_read_unlock() before copy_to_user()

2015-01-22 Thread Michael Holzheu
We must not hold locks when calling copy_to_user():

BUG: sleeping function called from invalid context at mm/memory.c:3732
in_atomic(): 0, irqs_disabled(): 0, pid: 671, name: test_maps
1 lock held by test_maps/671:
 #0:  (rcu_read_lock){..}, at: [<00264190>] 
map_lookup_elem+0xe8/0x260
Preemption disabled at:[<001be3b6>] vprintk_default+0x56/0x68

CPU: 0 PID: 671 Comm: test_maps Not tainted 3.19.0-rc5-00117-g5eb11d6-dirty #424
   1e447bb0 1e447c40 0002  
   1e447ce0 1e447c58 1e447c58 00115c8a 
    00c08246 00c27e8a 000b 
   1e447ca0 1e447c40   
    00115c8a 1e447c40 1e447ca0 
Call Trace:
([<00115b7e>] show_trace+0x12e/0x150)
 [<00115c40>] show_stack+0xa0/0x100
 [<009b163c>] dump_stack+0x74/0xc8
 [<0017424a>] ___might_sleep+0x23a/0x248
 [<002b58e8>] might_fault+0x70/0xe8
 [<00264230>] map_lookup_elem+0x188/0x260
 [<00264716>] SyS_bpf+0x20e/0x840
 [<009bbe3a>] system_call+0xd6/0x24c
 [<03fffd15f566>] 0x3fffd15f566
1 lock held by test_maps/671:
 #0:  (rcu_read_lock){..}, at: [<00264190>] 
map_lookup_elem+0xe8/0x260

So call rcu_read_unlock() before copy_to_user(). We can
release the lock earlier because it is not needed for copy_to_user().

Signed-off-by: Michael Holzheu 
---
 kernel/bpf/syscall.c |7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -172,17 +172,16 @@ static int map_lookup_elem(union bpf_att
err = -ENOENT;
rcu_read_lock();
value = map->ops->map_lookup_elem(map, key);
+   rcu_read_unlock();
if (!value)
-   goto err_unlock;
+   goto free_key;
 
err = -EFAULT;
if (copy_to_user(uvalue, value, map->value_size) != 0)
-   goto err_unlock;
+   goto free_key;
 
err = 0;
 
-err_unlock:
-   rcu_read_unlock();
 free_key:
kfree(key);
 err_put:

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] bpf: Call rcu_read_unlock() before copy_to_user()

2015-01-22 Thread Michael Holzheu
We must not hold locks when calling copy_to_user():

BUG: sleeping function called from invalid context at mm/memory.c:3732
in_atomic(): 0, irqs_disabled(): 0, pid: 671, name: test_maps
1 lock held by test_maps/671:
 #0:  (rcu_read_lock){..}, at: [00264190] 
map_lookup_elem+0xe8/0x260
Preemption disabled at:[001be3b6] vprintk_default+0x56/0x68

CPU: 0 PID: 671 Comm: test_maps Not tainted 3.19.0-rc5-00117-g5eb11d6-dirty #424
   1e447bb0 1e447c40 0002  
   1e447ce0 1e447c58 1e447c58 00115c8a 
    00c08246 00c27e8a 000b 
   1e447ca0 1e447c40   
    00115c8a 1e447c40 1e447ca0 
Call Trace:
([00115b7e] show_trace+0x12e/0x150)
 [00115c40] show_stack+0xa0/0x100
 [009b163c] dump_stack+0x74/0xc8
 [0017424a] ___might_sleep+0x23a/0x248
 [002b58e8] might_fault+0x70/0xe8
 [00264230] map_lookup_elem+0x188/0x260
 [00264716] SyS_bpf+0x20e/0x840
 [009bbe3a] system_call+0xd6/0x24c
 [03fffd15f566] 0x3fffd15f566
1 lock held by test_maps/671:
 #0:  (rcu_read_lock){..}, at: [00264190] 
map_lookup_elem+0xe8/0x260

So call rcu_read_unlock() before copy_to_user(). We can
release the lock earlier because it is not needed for copy_to_user().

Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
---
 kernel/bpf/syscall.c |7 +++
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -172,17 +172,16 @@ static int map_lookup_elem(union bpf_att
err = -ENOENT;
rcu_read_lock();
value = map-ops-map_lookup_elem(map, key);
+   rcu_read_unlock();
if (!value)
-   goto err_unlock;
+   goto free_key;
 
err = -EFAULT;
if (copy_to_user(uvalue, value, map-value_size) != 0)
-   goto err_unlock;
+   goto free_key;
 
err = 0;
 
-err_unlock:
-   rcu_read_unlock();
 free_key:
kfree(key);
 err_put:

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] samples/bpf: Fix test_maps/bpf_get_next_key() test

2015-01-22 Thread Michael Holzheu
Looks like the test_maps test case expects to get the keys in
the wrong order when iterating over the elements:

test_maps: samples/bpf/test_maps.c:79: test_hashmap_sanity: Assertion
`bpf_get_next_key(map_fd, key, next_key) == 0  next_key == 2' failed.
Aborted

Fix this and test for the correct order.  

Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
---
 samples/bpf/test_maps.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/samples/bpf/test_maps.c
+++ b/samples/bpf/test_maps.c
@@ -69,9 +69,9 @@ static void test_hashmap_sanity(int i, v
 
/* iterate over two elements */
assert(bpf_get_next_key(map_fd, key, next_key) == 0 
-  next_key == 2);
-   assert(bpf_get_next_key(map_fd, next_key, next_key) == 0 
   next_key == 1);
+   assert(bpf_get_next_key(map_fd, next_key, next_key) == 0 
+  next_key == 2);
assert(bpf_get_next_key(map_fd, next_key, next_key) == -1 
   errno == ENOENT);
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] bpf: Call rcu_read_unlock() before copy_to_user()

2015-01-22 Thread Michael Holzheu
On Thu, 22 Jan 2015 09:27:21 -0800
Alexei Starovoitov alexei.starovoi...@gmail.com wrote:

 On Thu, Jan 22, 2015 at 7:57 AM, Michael Holzheu
 holz...@linux.vnet.ibm.com wrote:
  We must not hold locks when calling copy_to_user():
 
  BUG: sleeping function called from invalid context at mm/memory.c:3732
  in_atomic(): 0, irqs_disabled(): 0, pid: 671, name: test_maps
  1 lock held by test_maps/671:
   #0:  (rcu_read_lock){..}, at: [00264190] 
  map_lookup_elem+0xe8/0x260
  Preemption disabled at:[001be3b6] vprintk_default+0x56/0x68
 
  CPU: 0 PID: 671 Comm: test_maps Not tainted 3.19.0-rc5-00117-g5eb11d6-dirty 
  #424
 1e447bb0 1e447c40 0002 
 1e447ce0 1e447c58 1e447c58 00115c8a
  00c08246 00c27e8a 000b
 1e447ca0 1e447c40  
  00115c8a 1e447c40 1e447ca0
  Call Trace:
  ([00115b7e] show_trace+0x12e/0x150)
   [00115c40] show_stack+0xa0/0x100
   [009b163c] dump_stack+0x74/0xc8
   [0017424a] ___might_sleep+0x23a/0x248
   [002b58e8] might_fault+0x70/0xe8
   [00264230] map_lookup_elem+0x188/0x260
   [00264716] SyS_bpf+0x20e/0x840
   [009bbe3a] system_call+0xd6/0x24c
   [03fffd15f566] 0x3fffd15f566
  1 lock held by test_maps/671:
   #0:  (rcu_read_lock){..}, at: [00264190] 
  map_lookup_elem+0xe8/0x260
 
  So call rcu_read_unlock() before copy_to_user(). We can
  release the lock earlier because it is not needed for copy_to_user().
 
 we cannot move the rcu unlock this way, since it protects the value.
 So we need to copy the value while still under rcu.

Ok, right. I assume you will provide the correct fix.

 I'm puzzled how I missed this warning.
 I guess you have CONFIG_PREEMPT_RCU=y ?

Yes.

Regards,
Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading

2013-11-25 Thread Michael Holzheu
On Mon, 25 Nov 2013 10:36:20 -0500
Vivek Goyal  wrote:

> On Mon, Nov 25, 2013 at 11:04:28AM +0100, Michael Holzheu wrote:
> > On Fri, 22 Nov 2013 05:34:03 -0800
> > ebied...@xmission.com (Eric W. Biederman) wrote:
> > 
> > > Vivek Goyal  writes:
> > 
> > > >> There is also a huge missing piece of this in that your purgatory is 
> > > >> not
> > > >> checking a hash of the loaded image before jumping too it.  Without 
> > > >> that
> > > >> this is a huge regression at least for the kexec on panic case.  We
> > > >> absolutely need to check that the kernel sitting around in memory has
> > > >> not been corrupted before we let it run very far.
> > > >
> > > > Agreed. This should not be hard. It is just a matter of calcualting
> > > > digest of segments. I will store it in kimge and verify digest again
> > > > before passing control to control page. Will fix it in next version.
> > > 
> > > Nak.  The verification needs to happen in purgatory. 
> > > 
> > > The verification needs to happen in code whose runtime environment is
> > > does not depend on random parts of the kernel.  Anything else is a
> > > regression in maintainability and reliability.
> > 
> > Hello Vivek,
> > 
> > Just to be sure that you have not forgotten the following s390 detail:
> > 
> > On s390 we first call purgatory with parameter "0" for doing the
> > checksum test. If this fails, we can have as backup solution our
> > traditional stand-alone dump. In case tha checksum test was ok,
> > we call purgatory a second time with parameter "1" which then
> > starts kdump.
> > 
> > Could you please ensure that this mechanism also works after
> > your rework.
> 
> Hi Michael,
> 
> All that logic in in arch dependent portion of s390? If yes, I am not
> touching any arch dependent part of s390 yet and only doing implementation
> of x86.

Yes, part of s390 architecture code (kernel and kexec purgatory).

kernel:
---
arch/s390/kernel/machine_kexec.c:
 kdump_csum_valid() -> rc = start_kdump(0);
 __do_machine_kdump() -> start_kdump(1)

kexec tools:

purgatory/arch/s390/setup-s390.S
  cghi %r2,0
  je verify_checksums

> Generic changes should be usable by s390 and you should be able to do
> same thing there. Though we are still detating whether segment checksum
> verification logic should be part of purgatory or core kernel.

Yes, that was my concern. If you move the purgatory checksum logic to
the kernel we probably have to consider our s390 checksum test.

Thanks!
Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading

2013-11-25 Thread Michael Holzheu
On Fri, 22 Nov 2013 05:34:03 -0800
ebied...@xmission.com (Eric W. Biederman) wrote:

> Vivek Goyal  writes:

> >> There is also a huge missing piece of this in that your purgatory is not
> >> checking a hash of the loaded image before jumping too it.  Without that
> >> this is a huge regression at least for the kexec on panic case.  We
> >> absolutely need to check that the kernel sitting around in memory has
> >> not been corrupted before we let it run very far.
> >
> > Agreed. This should not be hard. It is just a matter of calcualting
> > digest of segments. I will store it in kimge and verify digest again
> > before passing control to control page. Will fix it in next version.
> 
> Nak.  The verification needs to happen in purgatory. 
> 
> The verification needs to happen in code whose runtime environment is
> does not depend on random parts of the kernel.  Anything else is a
> regression in maintainability and reliability.

Hello Vivek,

Just to be sure that you have not forgotten the following s390 detail:

On s390 we first call purgatory with parameter "0" for doing the
checksum test. If this fails, we can have as backup solution our
traditional stand-alone dump. In case tha checksum test was ok,
we call purgatory a second time with parameter "1" which then
starts kdump.

Could you please ensure that this mechanism also works after
your rework.

Best Regards,
Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading

2013-11-25 Thread Michael Holzheu
On Fri, 22 Nov 2013 05:34:03 -0800
ebied...@xmission.com (Eric W. Biederman) wrote:

 Vivek Goyal vgo...@redhat.com writes:

  There is also a huge missing piece of this in that your purgatory is not
  checking a hash of the loaded image before jumping too it.  Without that
  this is a huge regression at least for the kexec on panic case.  We
  absolutely need to check that the kernel sitting around in memory has
  not been corrupted before we let it run very far.
 
  Agreed. This should not be hard. It is just a matter of calcualting
  digest of segments. I will store it in kimge and verify digest again
  before passing control to control page. Will fix it in next version.
 
 Nak.  The verification needs to happen in purgatory. 
 
 The verification needs to happen in code whose runtime environment is
 does not depend on random parts of the kernel.  Anything else is a
 regression in maintainability and reliability.

Hello Vivek,

Just to be sure that you have not forgotten the following s390 detail:

On s390 we first call purgatory with parameter 0 for doing the
checksum test. If this fails, we can have as backup solution our
traditional stand-alone dump. In case tha checksum test was ok,
we call purgatory a second time with parameter 1 which then
starts kdump.

Could you please ensure that this mechanism also works after
your rework.

Best Regards,
Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/6] kexec: A new system call to allow in kernel loading

2013-11-25 Thread Michael Holzheu
On Mon, 25 Nov 2013 10:36:20 -0500
Vivek Goyal vgo...@redhat.com wrote:

 On Mon, Nov 25, 2013 at 11:04:28AM +0100, Michael Holzheu wrote:
  On Fri, 22 Nov 2013 05:34:03 -0800
  ebied...@xmission.com (Eric W. Biederman) wrote:
  
   Vivek Goyal vgo...@redhat.com writes:
  
There is also a huge missing piece of this in that your purgatory is 
not
checking a hash of the loaded image before jumping too it.  Without 
that
this is a huge regression at least for the kexec on panic case.  We
absolutely need to check that the kernel sitting around in memory has
not been corrupted before we let it run very far.
   
Agreed. This should not be hard. It is just a matter of calcualting
digest of segments. I will store it in kimge and verify digest again
before passing control to control page. Will fix it in next version.
   
   Nak.  The verification needs to happen in purgatory. 
   
   The verification needs to happen in code whose runtime environment is
   does not depend on random parts of the kernel.  Anything else is a
   regression in maintainability and reliability.
  
  Hello Vivek,
  
  Just to be sure that you have not forgotten the following s390 detail:
  
  On s390 we first call purgatory with parameter 0 for doing the
  checksum test. If this fails, we can have as backup solution our
  traditional stand-alone dump. In case tha checksum test was ok,
  we call purgatory a second time with parameter 1 which then
  starts kdump.
  
  Could you please ensure that this mechanism also works after
  your rework.
 
 Hi Michael,
 
 All that logic in in arch dependent portion of s390? If yes, I am not
 touching any arch dependent part of s390 yet and only doing implementation
 of x86.

Yes, part of s390 architecture code (kernel and kexec purgatory).

kernel:
---
arch/s390/kernel/machine_kexec.c:
 kdump_csum_valid() - rc = start_kdump(0);
 __do_machine_kdump() - start_kdump(1)

kexec tools:

purgatory/arch/s390/setup-s390.S
  cghi %r2,0
  je verify_checksums

 Generic changes should be usable by s390 and you should be able to do
 same thing there. Though we are still detating whether segment checksum
 verification logic should be part of purgatory or core kernel.

Yes, that was my concern. If you move the purgatory checksum logic to
the kernel we probably have to consider our s390 checksum test.

Thanks!
Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] procfs, vmcore: fix regression of mmap on /proc/vmcore since v3.12-rc1

2013-10-15 Thread Michael Holzheu
Hello Hatayama,

We successfully tested your patches on s390, mmap for /proc/vmcore
seems to work again.

Thanks!

Michael

On Mon, 14 Oct 2013 18:36:06 +0900
HATAYAMA Daisuke  wrote:

> This patch set fixes regression of mmap on /proc/vmcore since
> v3.12-rc1. The primary one is the 2nd patch. The 1st patch is another
> bug I found during investiation and it affects mmap on /proc/vmcore
> even if only 2nd patch is applied, so fix it together in this patch
> set. The last patch is just for cleaning up of target function in both
> 1st and 2nd patches.
> 
> ---
> 
> HATAYAMA Daisuke (3):
>   procfs: fix unintended truncation of returned mapped address
>   procfs: Call default get_unmapped_area on MMU-present architectures
>   procfs: clean up proc_reg_get_unmapped_area for 80-column limit
> 
> 
>  fs/proc/inode.c |   20 ++--
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/3] procfs, vmcore: fix regression of mmap on /proc/vmcore since v3.12-rc1

2013-10-15 Thread Michael Holzheu
Hello Hatayama,

We successfully tested your patches on s390, mmap for /proc/vmcore
seems to work again.

Thanks!

Michael

On Mon, 14 Oct 2013 18:36:06 +0900
HATAYAMA Daisuke d.hatay...@jp.fujitsu.com wrote:

 This patch set fixes regression of mmap on /proc/vmcore since
 v3.12-rc1. The primary one is the 2nd patch. The 1st patch is another
 bug I found during investiation and it affects mmap on /proc/vmcore
 even if only 2nd patch is applied, so fix it together in this patch
 set. The last patch is just for cleaning up of target function in both
 1st and 2nd patches.
 
 ---
 
 HATAYAMA Daisuke (3):
   procfs: fix unintended truncation of returned mapped address
   procfs: Call default get_unmapped_area on MMU-present architectures
   procfs: clean up proc_reg_get_unmapped_area for 80-column limit
 
 
  fs/proc/inode.c |   20 ++--
  1 file changed, 14 insertions(+), 6 deletions(-)
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


mmap for /proc/vmcore broken since 3.12-rc1

2013-10-02 Thread Michael Holzheu
Hello Alexey,

Looks like the following commit broke mmap for /proc/vmcore:

commit c4fe24485729fc2cbff324c111e67a1cc2f9adea
Author: Alexey Dobriyan 
Date:   Tue Aug 20 22:17:24 2013 +0300

sparc: fix PCI device proc file mmap(2)

Because /proc/vmcore (fs/proc/vmcore.c) does not implement the
get_unmapped_area() fops function mmap now always returns EIO.

Michael

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


mmap for /proc/vmcore broken since 3.12-rc1

2013-10-02 Thread Michael Holzheu
Hello Alexey,

Looks like the following commit broke mmap for /proc/vmcore:

commit c4fe24485729fc2cbff324c111e67a1cc2f9adea
Author: Alexey Dobriyan adobri...@gmail.com
Date:   Tue Aug 20 22:17:24 2013 +0300

sparc: fix PCI device proc file mmap(2)

Because /proc/vmcore (fs/proc/vmcore.c) does not implement the
get_unmapped_area() fops function mmap now always returns EIO.

Michael

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: Fix bootmem error handling in pcpu_page_first_chunk()

2013-09-17 Thread Michael Holzheu
If memory allocation of in pcpu_embed_first_chunk() fails, the
allocated memory is not released correctly. In the release loop also
the non-allocated elements are released which leads to the following
kernel BUG on systems with very little memory:

[0.00] kernel BUG at mm/bootmem.c:307!
[0.00] illegal operation: 0001 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[0.00] Modules linked in:
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0 #22
[0.00] task: 00a20ae0 ti: 00a08000 task.ti: 
00a08000
[0.00] Krnl PSW : 04018000 00abda7a (__free+0x116/0x154)
[0.00]R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:0 CC:0 PM:0 
EA:3
...
[0.00]  [<00abdce2>] mark_bootmem_node+0xde/0xf0
[0.00]  [<00abdd9c>] mark_bootmem+0xa8/0x118
[0.00]  [<00abcbba>] pcpu_embed_first_chunk+0xe7a/0xf0c
[0.00]  [<00abcc96>] setup_per_cpu_areas+0x4a/0x28c

To fix the problem now only allocated elements are released. This then
leads to the correct kernel panic:

[0.00] Kernel panic - not syncing: Failed to initialize percpu areas.
...
[0.00] Call Trace:
[0.00] ([<0011307e>] show_trace+0x132/0x150)
[0.00]  [<00113160>] show_stack+0xc4/0xd4
[0.00]  [<007127dc>] dump_stack+0x74/0xd8
[0.00]  [<007123fe>] panic+0xea/0x264
[0.00]  [<00b14814>] setup_per_cpu_areas+0x5c/0x28c

Signed-off-by: Michael Holzheu 
---
 mm/percpu.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1705,9 +1705,12 @@ int __init pcpu_embed_first_chunk(size_t
goto out_free;
 
 out_free_areas:
-   for (group = 0; group < ai->nr_groups; group++)
+   for (group = 0; group < ai->nr_groups; group++) {
+   if (!areas[group])
+   continue;
free_fn(areas[group],
ai->groups[group].nr_units * ai->unit_size);
+   }
 out_free:
pcpu_free_alloc_info(ai);
if (areas)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] mm: Fix bootmem error handling in pcpu_page_first_chunk()

2013-09-17 Thread Michael Holzheu
If memory allocation of in pcpu_embed_first_chunk() fails, the
allocated memory is not released correctly. In the release loop also
the non-allocated elements are released which leads to the following
kernel BUG on systems with very little memory:

[0.00] kernel BUG at mm/bootmem.c:307!
[0.00] illegal operation: 0001 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[0.00] Modules linked in:
[0.00] CPU: 0 PID: 0 Comm: swapper Not tainted 3.10.0 #22
[0.00] task: 00a20ae0 ti: 00a08000 task.ti: 
00a08000
[0.00] Krnl PSW : 04018000 00abda7a (__free+0x116/0x154)
[0.00]R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:0 CC:0 PM:0 
EA:3
...
[0.00]  [00abdce2] mark_bootmem_node+0xde/0xf0
[0.00]  [00abdd9c] mark_bootmem+0xa8/0x118
[0.00]  [00abcbba] pcpu_embed_first_chunk+0xe7a/0xf0c
[0.00]  [00abcc96] setup_per_cpu_areas+0x4a/0x28c

To fix the problem now only allocated elements are released. This then
leads to the correct kernel panic:

[0.00] Kernel panic - not syncing: Failed to initialize percpu areas.
...
[0.00] Call Trace:
[0.00] ([0011307e] show_trace+0x132/0x150)
[0.00]  [00113160] show_stack+0xc4/0xd4
[0.00]  [007127dc] dump_stack+0x74/0xd8
[0.00]  [007123fe] panic+0xea/0x264
[0.00]  [00b14814] setup_per_cpu_areas+0x5c/0x28c

Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
---
 mm/percpu.c |5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1705,9 +1705,12 @@ int __init pcpu_embed_first_chunk(size_t
goto out_free;
 
 out_free_areas:
-   for (group = 0; group  ai-nr_groups; group++)
+   for (group = 0; group  ai-nr_groups; group++) {
+   if (!areas[group])
+   continue;
free_fn(areas[group],
ai-groups[group].nr_units * ai-unit_size);
+   }
 out_free:
pcpu_free_alloc_info(ai);
if (areas)

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 0/6] kdump: Allow ELF header creation in new kernel

2013-07-23 Thread Michael Holzheu
Hello Andrew,

Here the new kdump patch series as discussed with Vivek and Hatayama
during the last months.

I adjusted the code to v3.11-rc2 where the following patches have been
integrated:

https://git.kernel.org/cgit/linux/kernel/git/s390/linux.git/commit/?h=features=5a74953ff56aa870d6913ef4d81934f5c620c59d
https://git.kernel.org/cgit/linux/kernel/git/s390/linux.git/commit/?h=features=191a2fa0a8d2bbb64c98f9b1976fcb37ee5eae6b
 

Vivek has accepted the patch series. So would you integrate it into 3.12?

Best Regards,
Michael

ChangeLog
=
v7 => v8)

- Rebase to v3.11-rc2
- Add patch 5/6: vmcore: Enable /proc/vmcore mmap for s390

v6 => v7)

- Rebase to v3.11-rc1
- Return VM_FAULT_SIGBUS in fault handler for non s390
- Use __va() for buffer in fault handler

v5 => v6)

- Set elfcorehdr_addr to ELFCORE_ADDR_ERR after elfcorehdr_free()
- Fix OLDMEM_SIZE/ZFCPDUMP_HSA_SIZE confusion
- Remove return VM_FAULT_MAJOR/MINOR
- Use find_or_create_page() in mmap_vmcore_fault()
- Use kfree instead of vfree in elfcorehdr_free()

v4 => v5)

- Add weak function elfcorehdr_read_notes() to read ELF notes
- Rename weak functions for ELF header access and use "vmcorehdr_" prefix
- Generic vmcore code calls elfcorehdr_alloc() if elfcorehdr= is not specified
- Add vmcore fault handler for mmap of non-resident memory regions
- Add weak function remap_oldmem_pfn_range() to be used by zfcpdump for mmap

v3 => v4)

- Rebase to 3.10-rc2 + vmcore mmap patches v8

v2 => v3)

- Get rid of ELFCORE_ADDR_NEWMEM
- Make read_from_crash_header() only read from kernel
- Move read_from_crash_header() to weak function arch_read_from_crash_header()
- Implement read_from_crash_header() strong function for s390
- Set elfcorehdr_addr to address of new ELF header

v1 => v2)

- Rebase 3.10-rc2 + vmcore mmap patches
- Introduced arch_get/free_crash_header() and ELFCORE_ADDR_NEWMEM

Feature Description
===
For s390 we want to use /proc/vmcore for our SCSI stand-alone
dump (zfcpdump). We have support where the first HSA_SIZE bytes are
saved into a hypervisor owned memory area (HSA) before the kdump
kernel is booted. When the kdump kernel starts, it is restricted
to use only HSA_SIZE bytes.

The advantages of this mechanism are:

* No crashkernel memory has to be defined in the old kernel.
* Early boot problems (before kexec_load has been done) can be dumped 
* Non-Linux systems can be dumped.

We modify the s390 copy_oldmem_page() function to read from the HSA memory
if memory below HSA_SIZE bytes is requested.

Since we cannot use the kexec tool to load the kernel in this scenario,
we have to build the ELF header in the 2nd (kdump/new) kernel.

So with the following patch set we would like to introduce the new
function that the ELF header for /proc/vmcore can be created in the 2nd
kernel memory.

The following steps are done during zfcpdump execution:

1.  Production system crashes
2.  User boots a SCSI disk that has been prepared with the zfcpdump tool
3.  Hypervisor saves CPU state of boot CPU and HSA_SIZE bytes of memory into HSA
4.  Boot loader loads kernel into low memory area
5.  Kernel boots and uses only HSA_SIZE bytes of memory
6.  Kernel saves registers of non-boot CPUs
7.  Kernel does memory detection for dump memory map
8.  Kernel creates ELF header for /proc/vmcore
9.  /proc/vmcore uses this header for initialization
10. The zfcpdump user space reads /proc/vmcore to write dump to SCSI disk
- copy_oldmem_page() copies from HSA for memory below HSA_SIZE
- copy_oldmem_page() copies from real memory for memory above HSA_SIZE

Jan Willeke (1):
  s390/vmcore: Implement remap_oldmem_pfn_range for s390

Michael Holzheu (5):
  vmcore: Introduce ELF header in new memory feature
  s390/vmcore: Use ELF header in new memory feature
  vmcore: Introduce remap_oldmem_pfn_range()
  vmcore: Enable /proc/vmcore mmap for s390
  s390/vmcore: Use vmcore for zfcpdump

 arch/s390/Kconfig |   3 +-
 arch/s390/include/asm/sclp.h  |   1 +
 arch/s390/kernel/crash_dump.c | 219 ++
 drivers/s390/char/zcore.c |   6 +-
 fs/proc/vmcore.c  | 154 +
 include/linux/crash_dump.h|   9 ++
 6 files changed, 329 insertions(+), 63 deletions(-)

-- 
1.8.2.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 1/6] vmcore: Introduce ELF header in new memory feature

2013-07-23 Thread Michael Holzheu
Currently for s390 we create the ELF core header in the 2nd kernel
with a small trick. We relocate the addresses in the ELF header in
a way that for the /proc/vmcore code it seems to be in the 1st kernel
(old) memory and the read_from_oldmem() returns the correct data.
This allows the /proc/vmcore code to use the ELF header in the
2nd kernel.

This patch now exchanges the old mechanism with the new and much
cleaner function call override feature that now offcially allows to
create the ELF core header in the 2nd kernel.

To use the new feature the following function have to be defined
by the architecture backend code to read from new memory:

 * elfcorehdr_alloc: Allocate ELF header
 * elfcorehdr_free: Free the memory of the ELF header
 * elfcorehdr_read: Read from ELF header
 * elfcorehdr_read_notes: Read from ELF notes

Signed-off-by: Michael Holzheu 
Acked-by: Vivek Goyal 
---
 fs/proc/vmcore.c   | 61 ++
 include/linux/crash_dump.h |  6 +
 2 files changed, 57 insertions(+), 10 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index a1a16eb..02cb3ff 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -123,6 +123,36 @@ static ssize_t read_from_oldmem(char *buf, size_t count,
return read;
 }
 
+/*
+ * Architectures may override this function to allocate ELF header in 2nd 
kernel
+ */
+int __weak elfcorehdr_alloc(unsigned long long *addr, unsigned long long *size)
+{
+   return 0;
+}
+
+/*
+ * Architectures may override this function to free header
+ */
+void __weak elfcorehdr_free(unsigned long long addr)
+{}
+
+/*
+ * Architectures may override this function to read from ELF header
+ */
+ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos)
+{
+   return read_from_oldmem(buf, count, ppos, 0);
+}
+
+/*
+ * Architectures may override this function to read from notes sections
+ */
+ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos)
+{
+   return read_from_oldmem(buf, count, ppos, 0);
+}
+
 /* Read from the ELF header and then the crash dump. On error, negative value 
is
  * returned otherwise number of bytes read are returned.
  */
@@ -357,7 +387,7 @@ static int __init update_note_header_size_elf64(const 
Elf64_Ehdr *ehdr_ptr)
notes_section = kmalloc(max_sz, GFP_KERNEL);
if (!notes_section)
return -ENOMEM;
-   rc = read_from_oldmem(notes_section, max_sz, , 0);
+   rc = elfcorehdr_read_notes(notes_section, max_sz, );
if (rc < 0) {
kfree(notes_section);
return rc;
@@ -444,7 +474,8 @@ static int __init copy_notes_elf64(const Elf64_Ehdr 
*ehdr_ptr, char *notes_buf)
if (phdr_ptr->p_type != PT_NOTE)
continue;
offset = phdr_ptr->p_offset;
-   rc = read_from_oldmem(notes_buf, phdr_ptr->p_memsz, , 0);
+   rc = elfcorehdr_read_notes(notes_buf, phdr_ptr->p_memsz,
+  );
if (rc < 0)
return rc;
notes_buf += phdr_ptr->p_memsz;
@@ -536,7 +567,7 @@ static int __init update_note_header_size_elf32(const 
Elf32_Ehdr *ehdr_ptr)
notes_section = kmalloc(max_sz, GFP_KERNEL);
if (!notes_section)
return -ENOMEM;
-   rc = read_from_oldmem(notes_section, max_sz, , 0);
+   rc = elfcorehdr_read_notes(notes_section, max_sz, );
if (rc < 0) {
kfree(notes_section);
return rc;
@@ -623,7 +654,8 @@ static int __init copy_notes_elf32(const Elf32_Ehdr 
*ehdr_ptr, char *notes_buf)
if (phdr_ptr->p_type != PT_NOTE)
continue;
offset = phdr_ptr->p_offset;
-   rc = read_from_oldmem(notes_buf, phdr_ptr->p_memsz, , 0);
+   rc = elfcorehdr_read_notes(notes_buf, phdr_ptr->p_memsz,
+  );
if (rc < 0)
return rc;
notes_buf += phdr_ptr->p_memsz;
@@ -810,7 +842,7 @@ static int __init parse_crash_elf64_headers(void)
addr = elfcorehdr_addr;
 
/* Read Elf header */
-   rc = read_from_oldmem((char*), sizeof(Elf64_Ehdr), , 0);
+   rc = elfcorehdr_read((char *), sizeof(Elf64_Ehdr), );
if (rc < 0)
return rc;
 
@@ -837,7 +869,7 @@ static int __init parse_crash_elf64_headers(void)
if (!elfcorebuf)
return -ENOMEM;
addr = elfcorehdr_addr;
-   rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz_orig, , 0);
+   rc = elfcorehdr_read(elfcorebuf, elfcorebuf_sz_orig, );
if (rc < 0)
goto fail;
 
@@ -866,7 +898,7 @@ static int __init parse_crash_elf32_headers(void)
 

[PATCH v8 4/6] s390/vmcore: Implement remap_oldmem_pfn_range for s390

2013-07-23 Thread Michael Holzheu
From: Jan Willeke 

This patch introduces the s390 specific way to map pages from oldmem.
The memory area below OLDMEM_SIZE is mapped with offset OLDMEM_BASE.
The other old memory is mapped directly.

Signed-off-by: Jan Willeke 
Signed-off-by: Michael Holzheu 
---
 arch/s390/kernel/crash_dump.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index 0c9a897..3e77615 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -99,6 +99,32 @@ ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
 }
 
 /*
+ * Remap "oldmem"
+ *
+ * For the kdump reserved memory this functions performs a swap operation:
+ * [0 - OLDMEM_SIZE] is mapped to [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]
+ */
+int remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from,
+  unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+   unsigned long size_old;
+   int rc;
+
+   if (pfn < OLDMEM_SIZE >> PAGE_SHIFT) {
+   size_old = min(size, OLDMEM_SIZE - (pfn << PAGE_SHIFT));
+   rc = remap_pfn_range(vma, from,
+pfn + (OLDMEM_BASE >> PAGE_SHIFT),
+size_old, prot);
+   if (rc || size == size_old)
+   return rc;
+   size -= size_old;
+   from += size_old;
+   pfn += size_old >> PAGE_SHIFT;
+   }
+   return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
  * Copy memory from old kernel
  */
 int copy_from_oldmem(void *dest, void *src, size_t count)
-- 
1.8.2.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 3/6] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-23 Thread Michael Holzheu
For zfcpdump we can't map the HSA storage because it is only available
via a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything besides
of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
is called.

This handler works as follows:

* Get already available or new page from page cache (find_or_create_page)
* Check if /proc/vmcore page is filled with data (PageUptodate)
* If yes:
  Return that page
* If no:
  Fill page using __vmcore_read(), set PageUptodate, and return page

Signed-off-by: Michael Holzheu 
Acked-by: Vivek Goyal 
cc: HATAYAMA Daisuke 
---
 fs/proc/vmcore.c   | 91 ++
 include/linux/crash_dump.h |  3 ++
 2 files changed, 86 insertions(+), 8 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 02cb3ff..3f6cf0e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "internal.h"
@@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(char *buf, size_t 
count, u64 *ppos)
return read_from_oldmem(buf, count, ppos, 0);
 }
 
+/*
+ * Architectures may override this function to map oldmem
+ */
+int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+   return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
+ * Copy to either kernel or user space
+ */
+static int copy_to(void *target, void *src, size_t size, int userbuf)
+{
+   if (userbuf) {
+   if (copy_to_user(target, src, size))
+   return -EFAULT;
+   } else {
+   memcpy(target, src, size);
+   }
+   return 0;
+}
+
 /* Read from the ELF header and then the crash dump. On error, negative value 
is
  * returned otherwise number of bytes read are returned.
  */
-static ssize_t read_vmcore(struct file *file, char __user *buffer,
-   size_t buflen, loff_t *fpos)
+static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
+int userbuf)
 {
ssize_t acc = 0, tmp;
size_t tsz;
@@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
/* Read ELF core header */
if (*fpos < elfcorebuf_sz) {
tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
-   if (copy_to_user(buffer, elfcorebuf + *fpos, tsz))
+   if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
 
tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
-   if (copy_to_user(buffer, kaddr, tsz))
+   if (copy_to(buffer, kaddr, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
if (*fpos < m->offset + m->size) {
tsz = min_t(size_t, m->offset + m->size - *fpos, 
buflen);
start = m->paddr + *fpos - m->offset;
-   tmp = read_from_oldmem(buffer, tsz, , 1);
+   tmp = read_from_oldmem(buffer, tsz, , userbuf);
if (tmp < 0)
return tmp;
buflen -= tsz;
@@ -225,6 +250,55 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
return acc;
 }
 
+static ssize_t read_vmcore(struct file *file, char __user *buffer,
+  size_t buflen, loff_t *fpos)
+{
+   return __read_vmcore(buffer, buflen, fpos, 1);
+}
+
+/*
+ * The vmcore fault handler uses the page cache and fills data using the
+ * standard __vmcore_read() function.
+ *
+ * On s390 the fault handler is used for memory regions that can't be mapped
+ * directly with remap_pfn_range().
+ */
+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+#ifdef CONFIG_S390
+   struct address_space *mapping = vma->vm_file->f_mapping;
+   pgoff_t index = vmf->pgoff;
+   struct page *page;
+   loff_t offset;
+   char *buf;
+   int rc;
+
+   page = find_or_create_page(

[PATCH v8 6/6] s390/vmcore: Use vmcore for zfcpdump

2013-07-23 Thread Michael Holzheu
This patch modifies the s390 copy_oldmem_page() and remap_oldmem_pfn_range()
function for zfcpdump to read from the HSA memory if memory below HSA_SIZE
bytes is requested. Otherwise real memory is used.

Signed-off-by: Michael Holzheu 
---
 arch/s390/Kconfig |   3 +-
 arch/s390/include/asm/sclp.h  |   1 +
 arch/s390/kernel/crash_dump.c | 122 +++---
 drivers/s390/char/zcore.c |   6 +--
 4 files changed, 110 insertions(+), 22 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 22f75b5..f88bdac 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -514,6 +514,7 @@ config CRASH_DUMP
bool "kernel crash dumps"
depends on 64BIT && SMP
select KEXEC
+   select ZFCPDUMP
help
  Generate crash dump after being started by kexec.
  Crash dump kernels are loaded in the main kernel with kexec-tools
@@ -524,7 +525,7 @@ config CRASH_DUMP
 config ZFCPDUMP
def_bool n
prompt "zfcpdump support"
-   select SMP
+   depends on SMP
help
  Select this option if you want to build an zfcpdump enabled kernel.
  Refer to  for more details on 
this.
diff --git a/arch/s390/include/asm/sclp.h b/arch/s390/include/asm/sclp.h
index 06a1361..7dc7f9c 100644
--- a/arch/s390/include/asm/sclp.h
+++ b/arch/s390/include/asm/sclp.h
@@ -56,5 +56,6 @@ bool sclp_has_linemode(void);
 bool sclp_has_vt220(void);
 int sclp_pci_configure(u32 fid);
 int sclp_pci_deconfigure(u32 fid);
+int memcpy_hsa(void *dest, unsigned long src, size_t count, int mode);
 
 #endif /* _ASM_S390_SCLP_H */
diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index 3e77615..c84f33d 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define PTR_ADD(x, y) (((char *) (x)) + ((unsigned long) (y)))
 #define PTR_SUB(x, y) (((char *) (x)) - ((unsigned long) (y)))
@@ -69,22 +70,41 @@ static ssize_t copy_page_real(void *buf, void *src, size_t 
csize)
 static void *elfcorehdr_newmem;
 
 /*
- * Copy one page from "oldmem"
+ * Copy one page from zfcpdump "oldmem"
+ *
+ * For pages below ZFCPDUMP_HSA_SIZE memory from the HSA is copied. Otherwise
+ * real memory copy is used.
+ */
+static ssize_t copy_oldmem_page_zfcpdump(char *buf, size_t csize,
+unsigned long src, int userbuf)
+{
+   int rc;
+
+   if (src < ZFCPDUMP_HSA_SIZE) {
+   rc = memcpy_hsa(buf, src, csize, userbuf);
+   } else {
+   if (userbuf)
+   rc = copy_to_user_real((void __force __user *) buf,
+  (void *) src, csize);
+   else
+   rc = memcpy_real(buf, (void *) src, csize);
+   }
+   return rc ? rc : csize;
+}
+
+/*
+ * Copy one page from kdump "oldmem"
  *
  * For the kdump reserved memory this functions performs a swap operation:
  *  - [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE] is mapped to [0 - OLDMEM_SIZE].
  *  - [0 - OLDMEM_SIZE] is mapped to [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]
  */
-ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
-size_t csize, unsigned long offset, int userbuf)
+static ssize_t copy_oldmem_page_kdump(char *buf, size_t csize,
+ unsigned long src, int userbuf)
+
 {
-   unsigned long src;
int rc;
 
-   if (!csize)
-   return 0;
-
-   src = (pfn << PAGE_SHIFT) + offset;
if (src < OLDMEM_SIZE)
src += OLDMEM_BASE;
else if (src > OLDMEM_BASE &&
@@ -95,17 +115,35 @@ ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
   (void *) src, csize);
else
rc = copy_page_real(buf, (void *) src, csize);
-   return (rc == 0) ? csize : rc;
+   return (rc == 0) ? rc : csize;
 }
 
 /*
- * Remap "oldmem"
+ * Copy one page from "oldmem"
+ */
+ssize_t copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
+unsigned long offset, int userbuf)
+{
+   unsigned long src;
+
+   if (!csize)
+   return 0;
+   src = (pfn << PAGE_SHIFT) + offset;
+   if (OLDMEM_BASE)
+   return copy_oldmem_page_kdump(buf, csize, src, userbuf);
+   else
+   return copy_oldmem_page_zfcpdump(buf, csize, src, userbuf);
+}
+
+/*
+ * Remap "oldmem" for kdump
  *
  * For the kdump reserved memory this functions performs a swap operation:
  * [0 - OLDMEM_SIZE] is mapped to [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]
  */
-int remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from,
-  unsigned long pfn, unsigned long size, pgprot_t prot)
+static int remap_oldmem_

[PATCH v8 5/6] vmcore: Enable /proc/vmcore mmap for s390

2013-07-23 Thread Michael Holzheu
The patch "s390/vmcore: Implement remap_oldmem_pfn_range for s390" allows
now to use mmap also on s390.

So enable mmap for s390 again.

Signed-off-by: Michael Holzheu 
---
 fs/proc/vmcore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 3f6cf0e..532808e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -327,7 +327,7 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
  * regions in the 1st kernel pointed to by PT_LOAD entries) into
  * virtually contiguous user-space in ELF layout.
  */
-#if defined(CONFIG_MMU) && !defined(CONFIG_S390)
+#ifdef CONFIG_MMU
 static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 {
size_t size = vma->vm_end - vma->vm_start;
-- 
1.8.2.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 2/6] s390/vmcore: Use ELF header in new memory feature

2013-07-23 Thread Michael Holzheu
This patch now exchanges the old relocate mechanism with the new
arch function call override mechanism that allows to create the ELF
core header in the 2nd kernel.

Signed-off-by: Michael Holzheu 
---
 arch/s390/kernel/crash_dump.c | 81 ---
 1 file changed, 54 insertions(+), 27 deletions(-)

diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index d8f3556..0c9a897 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -64,6 +64,11 @@ static ssize_t copy_page_real(void *buf, void *src, size_t 
csize)
 }
 
 /*
+ * Pointer to ELF header in new kernel
+ */
+static void *elfcorehdr_newmem;
+
+/*
  * Copy one page from "oldmem"
  *
  * For the kdump reserved memory this functions performs a swap operation:
@@ -368,14 +373,6 @@ static int get_mem_chunk_cnt(void)
 }
 
 /*
- * Relocate pointer in order to allow vmcore code access the data
- */
-static inline unsigned long relocate(unsigned long addr)
-{
-   return OLDMEM_BASE + addr;
-}
-
-/*
  * Initialize ELF loads (new kernel)
  */
 static int loads_init(Elf64_Phdr *phdr, u64 loads_offset)
@@ -426,7 +423,7 @@ static void *notes_init(Elf64_Phdr *phdr, void *ptr, u64 
notes_offset)
ptr = nt_vmcoreinfo(ptr);
memset(phdr, 0, sizeof(*phdr));
phdr->p_type = PT_NOTE;
-   phdr->p_offset = relocate(notes_offset);
+   phdr->p_offset = notes_offset;
phdr->p_filesz = (unsigned long) PTR_SUB(ptr, ptr_start);
phdr->p_memsz = phdr->p_filesz;
return ptr;
@@ -435,7 +432,7 @@ static void *notes_init(Elf64_Phdr *phdr, void *ptr, u64 
notes_offset)
 /*
  * Create ELF core header (new kernel)
  */
-static void s390_elf_corehdr_create(char **elfcorebuf, size_t *elfcorebuf_sz)
+int elfcorehdr_alloc(unsigned long long *addr, unsigned long long *size)
 {
Elf64_Phdr *phdr_notes, *phdr_loads;
int mem_chunk_cnt;
@@ -443,6 +440,11 @@ static void s390_elf_corehdr_create(char **elfcorebuf, 
size_t *elfcorebuf_sz)
u32 alloc_size;
u64 hdr_off;
 
+   if (!OLDMEM_BASE)
+   return 0;
+   /* If elfcorehdr= has been passed via cmdline, we use that one */
+   if (elfcorehdr_addr != ELFCORE_ADDR_MAX)
+   return 0;
mem_chunk_cnt = get_mem_chunk_cnt();
 
alloc_size = 0x1000 + get_cpu_cnt() * 0x300 +
@@ -460,27 +462,52 @@ static void s390_elf_corehdr_create(char **elfcorebuf, 
size_t *elfcorebuf_sz)
ptr = notes_init(phdr_notes, ptr, ((unsigned long) hdr) + hdr_off);
/* Init loads */
hdr_off = PTR_DIFF(ptr, hdr);
-   loads_init(phdr_loads, ((unsigned long) hdr) + hdr_off);
-   *elfcorebuf_sz = hdr_off;
-   *elfcorebuf = (void *) relocate((unsigned long) hdr);
-   BUG_ON(*elfcorebuf_sz > alloc_size);
+   loads_init(phdr_loads, hdr_off);
+   *addr = (unsigned long long) hdr;
+   elfcorehdr_newmem = hdr;
+   *size = (unsigned long long) hdr_off;
+   BUG_ON(elfcorehdr_size > alloc_size);
+   return 0;
 }
 
 /*
- * Create kdump ELF core header in new kernel, if it has not been passed via
- * the "elfcorehdr" kernel parameter
+ * Free ELF core header (new kernel)
  */
-static int setup_kdump_elfcorehdr(void)
+void elfcorehdr_free(unsigned long long addr)
 {
-   size_t elfcorebuf_sz;
-   char *elfcorebuf;
-
-   if (!OLDMEM_BASE || is_kdump_kernel())
-   return -EINVAL;
-   s390_elf_corehdr_create(, _sz);
-   elfcorehdr_addr = (unsigned long long) elfcorebuf;
-   elfcorehdr_size = elfcorebuf_sz;
-   return 0;
+   if (!elfcorehdr_newmem)
+   return;
+   kfree((void *)(unsigned long)addr);
+}
+
+/*
+ * Read from ELF header
+ */
+ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
+{
+   void *src = (void *)(unsigned long)*ppos;
+
+   src = elfcorehdr_newmem ? src : src - OLDMEM_BASE;
+   memcpy(buf, src, count);
+   *ppos += count;
+   return count;
 }
 
-subsys_initcall(setup_kdump_elfcorehdr);
+/*
+ * Read from ELF notes data
+ */
+ssize_t elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos)
+{
+   void *src = (void *)(unsigned long)*ppos;
+   int rc;
+
+   if (elfcorehdr_newmem) {
+   memcpy(buf, src, count);
+   } else {
+   rc = copy_from_oldmem(buf, src, count);
+   if (rc)
+   return rc;
+   }
+   *ppos += count;
+   return count;
+}
-- 
1.8.2.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 2/6] s390/vmcore: Use ELF header in new memory feature

2013-07-23 Thread Michael Holzheu
This patch now exchanges the old relocate mechanism with the new
arch function call override mechanism that allows to create the ELF
core header in the 2nd kernel.

Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
---
 arch/s390/kernel/crash_dump.c | 81 ---
 1 file changed, 54 insertions(+), 27 deletions(-)

diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index d8f3556..0c9a897 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -64,6 +64,11 @@ static ssize_t copy_page_real(void *buf, void *src, size_t 
csize)
 }
 
 /*
+ * Pointer to ELF header in new kernel
+ */
+static void *elfcorehdr_newmem;
+
+/*
  * Copy one page from oldmem
  *
  * For the kdump reserved memory this functions performs a swap operation:
@@ -368,14 +373,6 @@ static int get_mem_chunk_cnt(void)
 }
 
 /*
- * Relocate pointer in order to allow vmcore code access the data
- */
-static inline unsigned long relocate(unsigned long addr)
-{
-   return OLDMEM_BASE + addr;
-}
-
-/*
  * Initialize ELF loads (new kernel)
  */
 static int loads_init(Elf64_Phdr *phdr, u64 loads_offset)
@@ -426,7 +423,7 @@ static void *notes_init(Elf64_Phdr *phdr, void *ptr, u64 
notes_offset)
ptr = nt_vmcoreinfo(ptr);
memset(phdr, 0, sizeof(*phdr));
phdr-p_type = PT_NOTE;
-   phdr-p_offset = relocate(notes_offset);
+   phdr-p_offset = notes_offset;
phdr-p_filesz = (unsigned long) PTR_SUB(ptr, ptr_start);
phdr-p_memsz = phdr-p_filesz;
return ptr;
@@ -435,7 +432,7 @@ static void *notes_init(Elf64_Phdr *phdr, void *ptr, u64 
notes_offset)
 /*
  * Create ELF core header (new kernel)
  */
-static void s390_elf_corehdr_create(char **elfcorebuf, size_t *elfcorebuf_sz)
+int elfcorehdr_alloc(unsigned long long *addr, unsigned long long *size)
 {
Elf64_Phdr *phdr_notes, *phdr_loads;
int mem_chunk_cnt;
@@ -443,6 +440,11 @@ static void s390_elf_corehdr_create(char **elfcorebuf, 
size_t *elfcorebuf_sz)
u32 alloc_size;
u64 hdr_off;
 
+   if (!OLDMEM_BASE)
+   return 0;
+   /* If elfcorehdr= has been passed via cmdline, we use that one */
+   if (elfcorehdr_addr != ELFCORE_ADDR_MAX)
+   return 0;
mem_chunk_cnt = get_mem_chunk_cnt();
 
alloc_size = 0x1000 + get_cpu_cnt() * 0x300 +
@@ -460,27 +462,52 @@ static void s390_elf_corehdr_create(char **elfcorebuf, 
size_t *elfcorebuf_sz)
ptr = notes_init(phdr_notes, ptr, ((unsigned long) hdr) + hdr_off);
/* Init loads */
hdr_off = PTR_DIFF(ptr, hdr);
-   loads_init(phdr_loads, ((unsigned long) hdr) + hdr_off);
-   *elfcorebuf_sz = hdr_off;
-   *elfcorebuf = (void *) relocate((unsigned long) hdr);
-   BUG_ON(*elfcorebuf_sz  alloc_size);
+   loads_init(phdr_loads, hdr_off);
+   *addr = (unsigned long long) hdr;
+   elfcorehdr_newmem = hdr;
+   *size = (unsigned long long) hdr_off;
+   BUG_ON(elfcorehdr_size  alloc_size);
+   return 0;
 }
 
 /*
- * Create kdump ELF core header in new kernel, if it has not been passed via
- * the elfcorehdr kernel parameter
+ * Free ELF core header (new kernel)
  */
-static int setup_kdump_elfcorehdr(void)
+void elfcorehdr_free(unsigned long long addr)
 {
-   size_t elfcorebuf_sz;
-   char *elfcorebuf;
-
-   if (!OLDMEM_BASE || is_kdump_kernel())
-   return -EINVAL;
-   s390_elf_corehdr_create(elfcorebuf, elfcorebuf_sz);
-   elfcorehdr_addr = (unsigned long long) elfcorebuf;
-   elfcorehdr_size = elfcorebuf_sz;
-   return 0;
+   if (!elfcorehdr_newmem)
+   return;
+   kfree((void *)(unsigned long)addr);
+}
+
+/*
+ * Read from ELF header
+ */
+ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
+{
+   void *src = (void *)(unsigned long)*ppos;
+
+   src = elfcorehdr_newmem ? src : src - OLDMEM_BASE;
+   memcpy(buf, src, count);
+   *ppos += count;
+   return count;
 }
 
-subsys_initcall(setup_kdump_elfcorehdr);
+/*
+ * Read from ELF notes data
+ */
+ssize_t elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos)
+{
+   void *src = (void *)(unsigned long)*ppos;
+   int rc;
+
+   if (elfcorehdr_newmem) {
+   memcpy(buf, src, count);
+   } else {
+   rc = copy_from_oldmem(buf, src, count);
+   if (rc)
+   return rc;
+   }
+   *ppos += count;
+   return count;
+}
-- 
1.8.2.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 6/6] s390/vmcore: Use vmcore for zfcpdump

2013-07-23 Thread Michael Holzheu
This patch modifies the s390 copy_oldmem_page() and remap_oldmem_pfn_range()
function for zfcpdump to read from the HSA memory if memory below HSA_SIZE
bytes is requested. Otherwise real memory is used.

Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
---
 arch/s390/Kconfig |   3 +-
 arch/s390/include/asm/sclp.h  |   1 +
 arch/s390/kernel/crash_dump.c | 122 +++---
 drivers/s390/char/zcore.c |   6 +--
 4 files changed, 110 insertions(+), 22 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 22f75b5..f88bdac 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -514,6 +514,7 @@ config CRASH_DUMP
bool kernel crash dumps
depends on 64BIT  SMP
select KEXEC
+   select ZFCPDUMP
help
  Generate crash dump after being started by kexec.
  Crash dump kernels are loaded in the main kernel with kexec-tools
@@ -524,7 +525,7 @@ config CRASH_DUMP
 config ZFCPDUMP
def_bool n
prompt zfcpdump support
-   select SMP
+   depends on SMP
help
  Select this option if you want to build an zfcpdump enabled kernel.
  Refer to file:Documentation/s390/zfcpdump.txt for more details on 
this.
diff --git a/arch/s390/include/asm/sclp.h b/arch/s390/include/asm/sclp.h
index 06a1361..7dc7f9c 100644
--- a/arch/s390/include/asm/sclp.h
+++ b/arch/s390/include/asm/sclp.h
@@ -56,5 +56,6 @@ bool sclp_has_linemode(void);
 bool sclp_has_vt220(void);
 int sclp_pci_configure(u32 fid);
 int sclp_pci_deconfigure(u32 fid);
+int memcpy_hsa(void *dest, unsigned long src, size_t count, int mode);
 
 #endif /* _ASM_S390_SCLP_H */
diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index 3e77615..c84f33d 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -16,6 +16,7 @@
 #include asm/os_info.h
 #include asm/elf.h
 #include asm/ipl.h
+#include asm/sclp.h
 
 #define PTR_ADD(x, y) (((char *) (x)) + ((unsigned long) (y)))
 #define PTR_SUB(x, y) (((char *) (x)) - ((unsigned long) (y)))
@@ -69,22 +70,41 @@ static ssize_t copy_page_real(void *buf, void *src, size_t 
csize)
 static void *elfcorehdr_newmem;
 
 /*
- * Copy one page from oldmem
+ * Copy one page from zfcpdump oldmem
+ *
+ * For pages below ZFCPDUMP_HSA_SIZE memory from the HSA is copied. Otherwise
+ * real memory copy is used.
+ */
+static ssize_t copy_oldmem_page_zfcpdump(char *buf, size_t csize,
+unsigned long src, int userbuf)
+{
+   int rc;
+
+   if (src  ZFCPDUMP_HSA_SIZE) {
+   rc = memcpy_hsa(buf, src, csize, userbuf);
+   } else {
+   if (userbuf)
+   rc = copy_to_user_real((void __force __user *) buf,
+  (void *) src, csize);
+   else
+   rc = memcpy_real(buf, (void *) src, csize);
+   }
+   return rc ? rc : csize;
+}
+
+/*
+ * Copy one page from kdump oldmem
  *
  * For the kdump reserved memory this functions performs a swap operation:
  *  - [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE] is mapped to [0 - OLDMEM_SIZE].
  *  - [0 - OLDMEM_SIZE] is mapped to [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]
  */
-ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
-size_t csize, unsigned long offset, int userbuf)
+static ssize_t copy_oldmem_page_kdump(char *buf, size_t csize,
+ unsigned long src, int userbuf)
+
 {
-   unsigned long src;
int rc;
 
-   if (!csize)
-   return 0;
-
-   src = (pfn  PAGE_SHIFT) + offset;
if (src  OLDMEM_SIZE)
src += OLDMEM_BASE;
else if (src  OLDMEM_BASE 
@@ -95,17 +115,35 @@ ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
   (void *) src, csize);
else
rc = copy_page_real(buf, (void *) src, csize);
-   return (rc == 0) ? csize : rc;
+   return (rc == 0) ? rc : csize;
 }
 
 /*
- * Remap oldmem
+ * Copy one page from oldmem
+ */
+ssize_t copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
+unsigned long offset, int userbuf)
+{
+   unsigned long src;
+
+   if (!csize)
+   return 0;
+   src = (pfn  PAGE_SHIFT) + offset;
+   if (OLDMEM_BASE)
+   return copy_oldmem_page_kdump(buf, csize, src, userbuf);
+   else
+   return copy_oldmem_page_zfcpdump(buf, csize, src, userbuf);
+}
+
+/*
+ * Remap oldmem for kdump
  *
  * For the kdump reserved memory this functions performs a swap operation:
  * [0 - OLDMEM_SIZE] is mapped to [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]
  */
-int remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from,
-  unsigned long pfn, unsigned long size, pgprot_t prot)
+static int remap_oldmem_pfn_range_kdump(struct vm_area_struct

[PATCH v8 5/6] vmcore: Enable /proc/vmcore mmap for s390

2013-07-23 Thread Michael Holzheu
The patch s390/vmcore: Implement remap_oldmem_pfn_range for s390 allows
now to use mmap also on s390.

So enable mmap for s390 again.

Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
---
 fs/proc/vmcore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 3f6cf0e..532808e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -327,7 +327,7 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
  * regions in the 1st kernel pointed to by PT_LOAD entries) into
  * virtually contiguous user-space in ELF layout.
  */
-#if defined(CONFIG_MMU)  !defined(CONFIG_S390)
+#ifdef CONFIG_MMU
 static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
 {
size_t size = vma-vm_end - vma-vm_start;
-- 
1.8.2.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v8 3/6] vmcore: Introduce remap_oldmem_pfn_range()

2013-07-23 Thread Michael Holzheu
For zfcpdump we can't map the HSA storage because it is only available
via a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.

This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything besides
of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
is called.

This handler works as follows:

* Get already available or new page from page cache (find_or_create_page)
* Check if /proc/vmcore page is filled with data (PageUptodate)
* If yes:
  Return that page
* If no:
  Fill page using __vmcore_read(), set PageUptodate, and return page

Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
Acked-by: Vivek Goyal vgo...@redhat.com
cc: HATAYAMA Daisuke d.hatay...@jp.fujitsu.com
---
 fs/proc/vmcore.c   | 91 ++
 include/linux/crash_dump.h |  3 ++
 2 files changed, 86 insertions(+), 8 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 02cb3ff..3f6cf0e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -21,6 +21,7 @@
 #include linux/crash_dump.h
 #include linux/list.h
 #include linux/vmalloc.h
+#include linux/pagemap.h
 #include asm/uaccess.h
 #include asm/io.h
 #include internal.h
@@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(char *buf, size_t 
count, u64 *ppos)
return read_from_oldmem(buf, count, ppos, 0);
 }
 
+/*
+ * Architectures may override this function to map oldmem
+ */
+int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+   return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
+ * Copy to either kernel or user space
+ */
+static int copy_to(void *target, void *src, size_t size, int userbuf)
+{
+   if (userbuf) {
+   if (copy_to_user(target, src, size))
+   return -EFAULT;
+   } else {
+   memcpy(target, src, size);
+   }
+   return 0;
+}
+
 /* Read from the ELF header and then the crash dump. On error, negative value 
is
  * returned otherwise number of bytes read are returned.
  */
-static ssize_t read_vmcore(struct file *file, char __user *buffer,
-   size_t buflen, loff_t *fpos)
+static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
+int userbuf)
 {
ssize_t acc = 0, tmp;
size_t tsz;
@@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
/* Read ELF core header */
if (*fpos  elfcorebuf_sz) {
tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
-   if (copy_to_user(buffer, elfcorebuf + *fpos, tsz))
+   if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
 
tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
-   if (copy_to_user(buffer, kaddr, tsz))
+   if (copy_to(buffer, kaddr, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
if (*fpos  m-offset + m-size) {
tsz = min_t(size_t, m-offset + m-size - *fpos, 
buflen);
start = m-paddr + *fpos - m-offset;
-   tmp = read_from_oldmem(buffer, tsz, start, 1);
+   tmp = read_from_oldmem(buffer, tsz, start, userbuf);
if (tmp  0)
return tmp;
buflen -= tsz;
@@ -225,6 +250,55 @@ static ssize_t read_vmcore(struct file *file, char __user 
*buffer,
return acc;
 }
 
+static ssize_t read_vmcore(struct file *file, char __user *buffer,
+  size_t buflen, loff_t *fpos)
+{
+   return __read_vmcore(buffer, buflen, fpos, 1);
+}
+
+/*
+ * The vmcore fault handler uses the page cache and fills data using the
+ * standard __vmcore_read() function.
+ *
+ * On s390 the fault handler is used for memory regions that can't be mapped
+ * directly with remap_pfn_range().
+ */
+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+#ifdef CONFIG_S390
+   struct address_space *mapping = vma-vm_file-f_mapping;
+   pgoff_t index = vmf-pgoff;
+   struct page *page

[PATCH v8 4/6] s390/vmcore: Implement remap_oldmem_pfn_range for s390

2013-07-23 Thread Michael Holzheu
From: Jan Willeke will...@de.ibm.com

This patch introduces the s390 specific way to map pages from oldmem.
The memory area below OLDMEM_SIZE is mapped with offset OLDMEM_BASE.
The other old memory is mapped directly.

Signed-off-by: Jan Willeke will...@de.ibm.com
Signed-off-by: Michael Holzheu holz...@linux.vnet.ibm.com
---
 arch/s390/kernel/crash_dump.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index 0c9a897..3e77615 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -99,6 +99,32 @@ ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
 }
 
 /*
+ * Remap oldmem
+ *
+ * For the kdump reserved memory this functions performs a swap operation:
+ * [0 - OLDMEM_SIZE] is mapped to [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]
+ */
+int remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from,
+  unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+   unsigned long size_old;
+   int rc;
+
+   if (pfn  OLDMEM_SIZE  PAGE_SHIFT) {
+   size_old = min(size, OLDMEM_SIZE - (pfn  PAGE_SHIFT));
+   rc = remap_pfn_range(vma, from,
+pfn + (OLDMEM_BASE  PAGE_SHIFT),
+size_old, prot);
+   if (rc || size == size_old)
+   return rc;
+   size -= size_old;
+   from += size_old;
+   pfn += size_old  PAGE_SHIFT;
+   }
+   return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
  * Copy memory from old kernel
  */
 int copy_from_oldmem(void *dest, void *src, size_t count)
-- 
1.8.2.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   >