Re: [PATCH v2 2/3] powerpc/mm: allow memory hotplug into a memoryless node

2016-09-20 Thread Reza Arbab

On Mon, Sep 19, 2016 at 09:53:49PM +1000, Balbir Singh wrote:

I presume you've tested with CONFIG_NODES_SHIFT of 8 (255 nodes?)


Oh yes, definitely.

The large number of possible nodes does not come into play here.

--
Reza Arbab



Re: [PATCH v2 3/3] mm: enable CONFIG_MOVABLE_NODE on powerpc

2016-09-20 Thread Reza Arbab

On Mon, Sep 19, 2016 at 11:59:35AM +0530, Aneesh Kumar K.V wrote:

Movable node also does.
memblock_set_bottom_up(true);
What is the impact of that. Do we need changes equivalent to that ? Also
where are we marking the nodes which can be hotplugged, ie where do we
do memblock_mark_hotplug() ?


These are related to the mechanism x86 uses to create movable nodes at 
boot. The SRAT is parsed to mark any hotplug memory. That marking is 
used later when initializing ZONE_MOVABLE for each node. [1]


The bottom-up allocation is due to a timing issue [2]. There is a window 
where kernel memory may be allocated before the SRAT is parsed. Any 
bottom-up allocations done during that time will likely be in the same 
(nonmovable) node as the kernel image.


On power, I don't think we have a heuristic equivalent to that SRAT 
memory hotplug info. So, we'll be limited to dynamically adding movable 
nodes after boot.


1. http://events.linuxfoundation.org/sites/events/files/lcjp13_chen.pdf
2. commit 79442ed189ac ("mm/memblock.c: introduce bottom-up allocation 
mode")


--
Reza Arbab



Re: [PATCH 4/4] drivers/pci/hotplug: Support surprise hotplug

2016-09-20 Thread Michael Ellerman
Gavin Shan  writes:

> This supports PCI surprise hotplug. The design is highlighted as
> below:
>
>* The PCI slot's surprise hotplug capability is exposed through
>  device node property "ibm,slot-surprise-pluggable", meaning
>  PCI surprise hotplug will be disabled if skiboot doesn't support
>  it yet.
>* The interrupt because of presence or link state change is raised
>  on surprise hotplug event. One event is allocated and queued to
>  the PCI slot for workqueue to pick it up and process in serialized
>  fashion. The code flow for surprise hotplug is same to that for
>  managed hotplug except: the affected PEs are put into frozen state
>  to avoid unexpected EEH error reporting in surprise hot remove path.
>
> Signed-off-by: Gavin Shan 
> ---
>  arch/powerpc/include/asm/pnv-pci.h |   9 ++
>  drivers/pci/hotplug/pnv_php.c  | 219 
> +

Can you please resend this and Cc linux-pci and Bjorn, thanks.

cheers


Re: [1/3] powerpc/eeh: Null check uses of eeh_pe_bus_get

2016-09-20 Thread Russell Currey
On Wed, 2016-09-21 at 14:02 +1000, Michael Ellerman wrote:
> On Mon, 2016-12-09 at 04:17:22 UTC, Russell Currey wrote:
> > 
> > eeh_pe_bus_get() can return NULL if a PCI bus isn't found for a given PE.
> > Some callers don't check this, and can cause a null pointer dereference
> > under certain circumstances.
> > 
> > Fix this by checking NULL everywhere eeh_pe_bus_get() is called.
> > 
> > Cc: stable #3.10+
> 
> This looks like it's a fix for 8a6b1bc70dbb ("powerpc/eeh: EEH core to handle
> special event") ?
> 
> Which was merged in v3.11-rc1.
> 
> If so I'll add a fixes line pointing at that commit and update the stable tag
> to
> v3.11+.

Thanks.

Also, the other two patches in this series shouldn't go to stable, that was my
mistake.

> 
> cheers



Re: [1/3] powerpc/eeh: Null check uses of eeh_pe_bus_get

2016-09-20 Thread Michael Ellerman
On Mon, 2016-12-09 at 04:17:22 UTC, Russell Currey wrote:
> eeh_pe_bus_get() can return NULL if a PCI bus isn't found for a given PE.
> Some callers don't check this, and can cause a null pointer dereference
> under certain circumstances.
> 
> Fix this by checking NULL everywhere eeh_pe_bus_get() is called.
> 
> Cc: stable #3.10+

This looks like it's a fix for 8a6b1bc70dbb ("powerpc/eeh: EEH core to handle
special event") ?

Which was merged in v3.11-rc1.

If so I'll add a fixes line pointing at that commit and update the stable tag to
v3.11+.

cheers


[PATCH] powerpc/64: abstract EE/RI handling between E/S cases

2016-09-20 Thread Nicholas Piggin
Signed-off-by: Nicholas Piggin 
---

This moves some of the ifdefs out of line and hopefully makes things
slightly more readable. Should be no generated code changes.

 arch/powerpc/kernel/entry_64.S | 111 +++--
 1 file changed, 62 insertions(+), 49 deletions(-)

diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 585b9ca..0a1e33b 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -40,6 +40,54 @@
 #include 
 
 /*
+ * Interrrupt disabling abstraction between BOOK3E and BOOK3S.
+ * - SET/CLEAR_EE must only be used with RI set.
+ * - SET/CLEAR_RI must only be used with EE clear.
+ * - SET/CLEAR_EE_RI must only be used when both bits get modified.
+ */
+#ifdef CONFIG_PPC_BOOK3E
+
+/* BOOK3E does not have RI handling, so that gets compiled away. */
+#define SET_EE(reg)wrteei  1
+#define CLEAR_EE(reg)  wrteei  0
+#define SET_RI(reg)
+#define CLEAR_RI(reg)
+#define SET_EE_RI(reg) SET_EE(reg)
+#define CLEAR_EE_RI(regCLEAR_EE(reg)
+
+#else /* CONFIG_PPC_BOOK3E */
+
+/*
+ * BOOK3S mtmsrd L=1 only modifies EE and RI bits of MSR. A single ori
+ * instruction could be used to set both bits in the register without changing
+ * other bits (which do not affect mtmsrd L=1), however it's likely that
+ * instruction scheduling would consider this a dependency.
+ */
+#define SET_EE_RI(reg) \
+   li  reg,MSR_RI; \
+   ori reg,reg,MSR_EE; \
+   mtmsrd  reg,1
+
+#define CLEAR_EE_RI(reg)   \
+   li  reg,0;  \
+   mtmsrd  reg,1
+
+#define CLEAR_EE(reg)  \
+   li  reg,MSR_RI; \
+   mtmsrd  reg,1
+
+#define SET_EE(reg)SET_EE_RI(reg)
+
+#define CLEAR_RI(reg)  CLEAR_EE_RI(reg)
+
+#define SET_RI(reg)\
+   li  reg,MSR_RI; \
+   mtmsrd  reg,1
+
+#endif /* CONFIG_PPC_BOOK3E */
+
+
+/*
  * System calls.
  */
.section".toc","aw"
@@ -136,13 +184,7 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
EMIT_BUG_ENTRY 1b,__FILE__,__LINE__,BUGFLAG_WARNING
 #endif
 
-#ifdef CONFIG_PPC_BOOK3E
-   wrteei  1
-#else
-   li  r11,MSR_RI
-   ori r11,r11,MSR_EE
-   mtmsrd  r11,1
-#endif /* CONFIG_PPC_BOOK3E */
+   SET_EE_RI(r11)
 
/* We do need to set SOFTE in the stack frame or the return
 * from interrupt will be painful
@@ -191,20 +233,15 @@ system_call:  /* label this so stack 
traces look sane */
/*
 * Disable interrupts so current_thread_info()->flags can't change,
 * and so that we don't get interrupted after loading SRR0/1.
-*/
-#ifdef CONFIG_PPC_BOOK3E
-   wrteei  0
-#else
-   /*
+*
+* BOOK3S:
 * For performance reasons we clear RI the same time that we
 * clear EE. We only need to clear RI just before we restore r13
 * below, but batching it with EE saves us one expensive mtmsrd call.
 * We have to be careful to restore RI if we branch anywhere from
 * here (eg syscall_exit_work).
 */
-   li  r11,0
-   mtmsrd  r11,1
-#endif /* CONFIG_PPC_BOOK3E */
+   CLEAR_EE_RI(r11)
 
ld  r9,TI_FLAGS(r12)
li  r11,-MAX_ERRNO
@@ -218,15 +255,9 @@ system_call:   /* label this so stack 
traces look sane */
bne 3f
 #endif
 2: addir3,r1,STACK_FRAME_OVERHEAD
-#ifdef CONFIG_PPC_BOOK3S
-   li  r10,MSR_RI
-   mtmsrd  r10,1   /* Restore RI */
-#endif
+   SET_RI(r10) /* See above RI optimisation */
bl  restore_math
-#ifdef CONFIG_PPC_BOOK3S
-   li  r11,0
-   mtmsrd  r11,1
-#endif
+   CLEAR_RI(r11)
ld  r8,_MSR(r1)
ld  r3,RESULT(r1)
li  r11,-MAX_ERRNO
@@ -304,13 +335,11 @@ syscall_enosys:
b   .Lsyscall_exit

 syscall_exit_work:
-#ifdef CONFIG_PPC_BOOK3S
-   li  r10,MSR_RI
-   mtmsrd  r10,1   /* Restore RI */
-#endif
-   /* If TIF_RESTOREALL is set, don't scribble on either r3 or ccr.
-If TIF_NOERROR is set, just save r3 as it is. */
-
+   SET_RI(r10) /* See above RI optimisation */
+   /*
+* If TIF_RESTOREALL is set, don't scribble on either r3 or ccr.
+* If TIF_NOERROR is set, just save r3 as it is.
+*/
andi.   r0,r9,_TIF_RESTOREALL
beq+0f
REST_NVGPRS(r1)
@@ -349,13 +378,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
beq ret_from_except_lite
 
/* Re-enable interrupts */
-#ifdef CONFIG_PPC_BOOK3E
-   wrteei  1
-#else
-   li  r10,MSR_RI
-   ori r10,r10,MSR_EE
-   mtmsrd  r10,1
-#endif /* CONFIG_PPC_BOOK3E */
+   SET_EE_RI(r10)
 
bl  save_nvgprs
addir3,r1,STACK_FRAME_OVERHEAD
@@ -614,12 +637,7 @@ _GLOBAL(ret_from_except_lite)
 * can't change between when we 

RE: [PATCH v5 2/2] QE: remove PPCisms for QE

2016-09-20 Thread Qiang Zhao
On Mon, Sep 20, 2016 at 4:13 AM, Leo Li wrote:
> -Original Message-
> From: Leo Li [mailto:pku@gmail.com]
> Sent: Tuesday, September 20, 2016 4:13 AM
> To: Qiang Zhao 
> Cc: Scott Wood ; linuxppc-dev  d...@lists.ozlabs.org>; lkml ; X.B. Xie
> 
> Subject: Re: [PATCH v5 2/2] QE: remove PPCisms for QE
> 
> On Mon, Jul 25, 2016 at 12:43 AM, Zhao Qiang  wrote:
> > QE was supported on PowerPC, and dependent on PPC, Now it is supported
> > on other platforms. so remove PPCisms.
> >
> > Signed-off-by: Zhao Qiang 
> > ---
> > Changes for v2:
> > - na
> > Changes for v3:
> > - add NO_IRQ
> > Changes for v4:
> > - modify spin_event_timeout to opencoded timeout loop
> > - remove NO_IRQ
> > - modify virq_to_hw to opencoed code Changes for v5:
> > - modify commit msg
> > - modify depends of QUICC_ENGINE
> > - add kerneldoc header for qe_issue_cmd
> >
> >  drivers/irqchip/qe_ic.c   | 28 +--
> >  drivers/soc/fsl/qe/Kconfig|  2 +-
> >  drivers/soc/fsl/qe/qe.c   | 80 ++--
> ---
> >  drivers/soc/fsl/qe/qe_io.c| 42 ++-
> >  drivers/soc/fsl/qe/qe_tdm.c   |  8 ++---
> >  drivers/soc/fsl/qe/ucc.c  | 10 +++---
> >  drivers/soc/fsl/qe/ucc_fast.c | 68 ++--
> >  include/soc/fsl/qe/qe.h   |  1 -
> >  include/soc/fsl/qe/qe_ic.h| 12 +++
> >  9 files changed, 133 insertions(+), 118 deletions(-)
> >
> 
> [snip]
> 
> > diff --git a/drivers/soc/fsl/qe/Kconfig b/drivers/soc/fsl/qe/Kconfig
> > index 73a2e08..b26b643 100644
> > --- a/drivers/soc/fsl/qe/Kconfig
> > +++ b/drivers/soc/fsl/qe/Kconfig
> > @@ -4,7 +4,7 @@
> >
> >  config QUICC_ENGINE
> > bool "Freescale QUICC Engine (QE) Support"
> > -   depends on FSL_SOC && PPC32
> > +   depends on OF && HAS_IOMEM
> > select GENERIC_ALLOCATOR
> > select CRC32
> > help
> 
> You make it possible to build QE drivers on ARM, but the UCC_GETH fails to
> build on arm64.  Please make sure all these drivers can build on other
> architectures.  Or you can simply make them only build for Power architecture
> as most of them are not available on ARM.
> 

Most of them are not available on ARM and ARM64.
Now, only qe-hdlc is available on ARM64.

BR
-Zhao Qiang


linux-next: manual merge of the powerpc tree with the kbuild tree

2016-09-20 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the powerpc tree got conflicts in:

  arch/powerpc/kernel/misc_32.S
  arch/powerpc/kernel/misc_64.S

between commit:

  9445aa1a3062 ("ppc: move exports to definitions")

from the kbuild tree and commit:

  6f698df10cb2 ("powerpc/kernel: Use kprobe blacklist for asm functions")

from the powerpc tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc arch/powerpc/kernel/misc_32.S
index f5156105c5f3,03756ffdcd71..
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@@ -360,7 -358,8 +360,9 @@@ END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_
sync/* additional sync needed on g4 */
isync
blr
+ _ASM_NOKPROBE_SYMBOL(flush_icache_range)
 +EXPORT_SYMBOL(flush_icache_range)
+ 
  /*
   * Flush a particular page from the data cache to RAM.
   * Note: this is necessary because the instruction cache does *not*
diff --cc arch/powerpc/kernel/misc_64.S
index 8b526846e72a,5d7e583f1588..
--- a/arch/powerpc/kernel/misc_64.S
+++ b/arch/powerpc/kernel/misc_64.S
@@@ -110,8 -109,8 +110,9 @@@ END_FTR_SECTION_IFSET(CPU_FTR_COHERENT_
bdnz2b
isync
blr
-   .previous .text
+ _ASM_NOKPROBE_SYMBOL(flush_icache_range)
 +EXPORT_SYMBOL(flush_icache_range)
+ 
  /*
   * Like above, but only do the D-cache.
   *


[powerpc:test 45/76] undefined reference to `.elf64_apply_relocate_add'

2016-09-20 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git test
head:   f616ddd0503c863e05be4a9d16b3c3ef8f85213f
commit: 90cf23bbbc9ef5d9da798c1b0caceb0ebc583b5f [45/76] powerpc: Factor out 
relocation code from module_64.c to elf_util_64.c.
config: powerpc-ppc64e_defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
git checkout 90cf23bbbc9ef5d9da798c1b0caceb0ebc583b5f
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All errors (new ones prefixed by >>):

   arch/powerpc/kernel/built-in.o: In function `.apply_relocate_add':
>> (.text+0x18828): undefined reference to `.elf64_apply_relocate_add'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATHC v2 0/9] ima: carry the measurement list across kexec

2016-09-20 Thread Eric W. Biederman

Thiago Jung Bauermann  writes:

> Am Samstag, 17 September 2016, 00:17:37 schrieb Eric W. Biederman:
>> Thiago Jung Bauermann  writes:
>> > Hello Eric,
>> > 
>> > Am Freitag, 16 September 2016, 14:47:13 schrieb Eric W. Biederman:
>> >> I can see tracking to see if the list has changed at some
>> >> point and causing a reboot(LINUX_REBOOT_CMD_KEXEC) to fail.
>> > 
>> > Yes, that is an interesting feature that I can add using the checksum-
>> > verifying part of my code. I can submit a patch for that if there's
>> > interest, adding a reboot notifier that verifies the checksum and causes
>> > a regular reboot instead of a kexec reboot if the checksum fails.
>> 
>> I was thinking an early failure instead of getting all of the way down
>> into a kernel an discovering the tpm/ima subsystem would not
>> initialized.  But where that falls in the reboot pathway I don't expect
>> there is much value in it.
>
> I'm not sure I understand. What I described doesn't involve the tpm or ima. 
> I'm suggesting that if I take the parts of patch 4/5 in the kexec hand-over 
> buffer series that verify the image checksum, I can submit a separate patch 
> that checks the integrity of the kexec image early in kernel_kexec() and 
> reverts to a regular reboot if the check fails. This would be orthogonal to 
> ima carrying its measurement list across kexec.
>
> I think there is value in that, because if the kexec image is corrupted the 
> machine will just get stuck in the purgatory and (unless it's a platform 
> where the purgatory can print to the console) without even an error message 
> explaining what is going on. Whereas if we notice the corruption before 
> jumping into the purgatory we can switch to a regular reboot and the machine 
> will boot successfully.
>
> To have an early failure, when would the checksum verification be done?
> What I can think of is to have kexec_file_load accept a new flag 
> KEXEC_FILE_VERIFY_IMAGE, which userspace could use to request an integrity 
> check when it's about to start the reboot procedure. Then it can decide to 
> either reload the kernel or use a regular reboot if the image is corrupted.
>
> Is this what you had in mind?

Sort of.

I was just thinking that instead of having the boot path verify your ima
list matches what is in the tpm and halting the boot there, we could
test that on reboot.  Which would give a clean failure without the nasty
poking into a prepared image.  The downside is that we have already run
the shutdown scripts so it wouldn't be much cleaner, than triggering a
machine reboot from elsewhere.

But I don't think we should spend too much time on that.  It was a
passing thought.  We should focus on getting a non-poked ima buffer
cleanly into kexec and we can worry about the rest later.

>> >> At least the common bootloader cases that I know of using kexec are
>> >> very
>> >> minimal distributions that live in a ramdisk and as such it should be
>> >> very straight forward to measure what is needed at or before
>> >> sys_kexec_load.  But that was completely dismissed as unrealistic so I
>> >> don't have a clue what actual problem you are trying to solve.
>> > 
>> > We are interested in solving the problem in a general way because it
>> > will be useful to us in the future for the case of an arbitrary number
>> > of kexecs (and thus not only a bootloader but also multiple full-blown
>> > distros may be involved in the chain).
>> > 
>> > But you are right that for the use case for which we currently need this
>> > feature it's feasible to measure everything upfront. We can cross the
>> > other bridge when we get there.
>> 
>> Then let's start there.  Passing the measurment list is something that
>> should not be controversial.
>
> Great!
>
>> >> If there is anyway we can start small and not with this big scary
>> >> infrastructure change I would very much prefer it.
>> > 
>> > Sounds good. If we pre-measure everything then the following patches
>> > from my buffer hand-over series are enough:
>> > 
>> > [PATCH v5 2/5] kexec_file: Add buffer hand-over support for the next
>> > kernel [PATCH v5 3/5] powerpc: kexec_file: Add buffer hand-over support
>> > for the next kernel
>> > 
>> > Would you consider including those two?
>> > 
>> > And like I mentioned in the cover letter, patch 1/5 is an interesting
>> > improvement that is worth considering.
>> 
>> So from 10,000 feet I think that is correct.
>> 
>> I am not quite certain why a new mechanism is being invented.  We have
>> other information that is already passed (much of it architecture
>> specific) like the flattened device tree.  If you remove the need to
>> update the information can you just append this information to the
>> flattened device tree without a new special mechanism to pass the data?
>> 
>> I am just reluctant to invent a new mechanism when there is an existing
>> mechanism that looks like it should work without problems.
>
> Michael 

Re: [PATCH v6 4/7] perf annotate: Do not ignore call instruction with indirect target

2016-09-20 Thread Arnaldo Carvalho de Melo
Em Tue, Sep 20, 2016 at 08:05:03PM +0530, Ravi Bangoria escreveu:
> On Monday 19 September 2016 09:14 PM, Arnaldo Carvalho de Melo wrote:
> > Em Fri, Aug 19, 2016 at 06:29:35PM +0530, Ravi Bangoria escreveu:
> >> Do not ignore call instruction with indirect target when its already
> >> identified as a call. This is an extension of commit e8ea1561952b
> >> ("perf annotate: Use raw form for register indirect call instructions")
> >> to generalize annotation for all instructions with indirect calls.

> >> This is needed for certain powerpc call instructions that use address
> >> in a register (such as bctrl, btarl, ...).
> >>
> >> Apart from that, when kcore is used to disassemble function, all call
> >> instructions were ignored. This patch will fix it as a side effect by
> >> not ignoring them. For example,
> >>
> >> Before (with kcore):
> >>mov%r13,%rdi
> >>callq  0x811a7e70
> >>  ^ jmpq   64
> >>mov%gs:0x7ef41a6e(%rip),%al
> >>
> >> After (with kcore):
> >>mov%r13,%rdi
> >>  > callq  0x811a7e70
> >>  ^ jmpq   64
> >>mov%gs:0x7ef41a6e(%rip),%al
> > Ok, makes sense, but then now I have the -> and can't press enter to go
> > to that function, in fact for the case I'm using as a test, the
> > vsnprintf kernel function, I get:
> >
> >│ 56:   test   %al,%al
> >│ ↓ je 81
> >│   lea-0x38(%rbp),%rsi
> >│   mov%r15,%rdi
> >│ → callq  0x993e3230 
> >
> > That 0x993e3230 should've been resolved to:
> >
> > [root@jouet ~]# grep 993e3230 /proc/kallsyms 
> > 993e3230 t format_decode
> >
> > Trying to investigate why it doesn't...
> 
> I investigated this.
> 
> If this example is with kcore, then it's expected. Because, perf annotate does
> not inspect kallsyms when it can't find symbol name from disassembly itself.
> 
> For example, disassembly of  finish_task_switch,

Yeah, that was the case, up to yesterday, please take a look at my
perf/core branch, specifically this patch:

https://git.kernel.org/cgit/linux/kernel/git/acme/linux.git/commit/?id=4de86ebeb433aa59764a13ce851cba08d747e0e1

I'll do that for other instructions as well, like with 'mov
$0xabcdef,%rax' that will try to resolve that address to a variable.

Will try to continue processing your (and others) patches first tho :-)

- Arnaldo
 
> with kcore:
> 
> 810cf1b0:   mov$0x1,%esi
> 810cf1b5:   mov$0x4,%edi
> 810cf1ba:   callq  0x811aced0
> 810cf1bf:   andb   $0xfb,0x4c4(%rbx)
> 810cf1c6:   jmpq   0x810cf0e9
> 810cf1cb:   mov%rbx,%rsi
> 810cf1ce:   mov%r13,%rdi
> 810cf1d1:   callq  0x811a7e70
> 810cf1d6:   jmpq   0x810cf0e4
> 
> with debuginfo:
 
> 810cf1b0:   mov$0x1,%esi
> 810cf1b5:   mov$0x4,%edi
> 810cf1ba:   callq  811aced0 <___perf_sw_event>
> 810cf1bf:   andb   $0xfb,0x4c4(%rbx)
> 810cf1c6:   jmpq   810cf0e9 
> 810cf1cb:   mov%rbx,%rsi
> 810cf1ce:   mov%r13,%rdi
> 810cf1d1:   callq  811a7e70 <__perf_event_task_sched_in>
> 810cf1d6:   jmpq   810cf0e4 
> 
> call__parse tries to find symbol from angle brackets which is not present
> in case of kcore.
> 
> -Ravi
> 
> 
> > - Arnaldo
> >
> >> Suggested-by: Michael Ellerman 
> >> [Suggested about 'bctrl' instruction]
> >> Signed-off-by: Ravi Bangoria 
> >> ---
> >> Changes in v6:
> >>   - No change
> >>
> >>  tools/perf/util/annotate.c | 8 ++--
> >>  1 file changed, 2 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
> >> index ea07588..a05423b 100644
> >> --- a/tools/perf/util/annotate.c
> >> +++ b/tools/perf/util/annotate.c
> >> @@ -81,16 +81,12 @@ static int call__parse(struct ins_operands *ops, const 
> >> char *norm_arch)
> >>return ops->target.name == NULL ? -1 : 0;
> >>  
> >>  indirect_call:
> >> -  tok = strchr(endptr, '(');
> >> -  if (tok != NULL) {
> >> +  tok = strchr(endptr, '*');
> >> +  if (tok == NULL) {
> >>ops->target.addr = 0;
> >>return 0;
> >>}
> >>  
> >> -  tok = strchr(endptr, '*');
> >> -  if (tok == NULL)
> >> -  return -1;
> >> -
> >>ops->target.addr = strtoull(tok + 1, NULL, 16);
> >>return 0;
> >>  }
> >> -- 
> >> 2.5.5


Re: [PATCH v6 4/7] perf annotate: Do not ignore call instruction with indirect target

2016-09-20 Thread Ravi Bangoria
Hi Arnaldo,

On Monday 19 September 2016 09:14 PM, Arnaldo Carvalho de Melo wrote:
> Em Fri, Aug 19, 2016 at 06:29:35PM +0530, Ravi Bangoria escreveu:
>> Do not ignore call instruction with indirect target when its already
>> identified as a call. This is an extension of commit e8ea1561952b
>> ("perf annotate: Use raw form for register indirect call instructions")
>> to generalize annotation for all instructions with indirect calls.
>>
>> This is needed for certain powerpc call instructions that use address
>> in a register (such as bctrl, btarl, ...).
>>
>> Apart from that, when kcore is used to disassemble function, all call
>> instructions were ignored. This patch will fix it as a side effect by
>> not ignoring them. For example,
>>
>> Before (with kcore):
>>mov%r13,%rdi
>>callq  0x811a7e70
>>  ^ jmpq   64
>>mov%gs:0x7ef41a6e(%rip),%al
>>
>> After (with kcore):
>>mov%r13,%rdi
>>  > callq  0x811a7e70
>>  ^ jmpq   64
>>mov%gs:0x7ef41a6e(%rip),%al
> Ok, makes sense, but then now I have the -> and can't press enter to go
> to that function, in fact for the case I'm using as a test, the
> vsnprintf kernel function, I get:
>
>│ 56:   test   %al,%al 
>   
>  ▒
>│ ↓ je 81  
>   
>  ▒
>│   lea-0x38(%rbp),%rsi
>   
>  ▒
>│   mov%r15,%rdi   
>   
>  ▒
>│ → callq  0x993e3230 
>
> That 0x993e3230 should've been resolved to:
>
> [root@jouet ~]# grep 993e3230 /proc/kallsyms 
> 993e3230 t format_decode
>
> Trying to investigate why it doesn't...

I investigated this.

If this example is with kcore, then it's expected. Because, perf annotate does
not inspect kallsyms when it can't find symbol name from disassembly itself.

For example, disassembly of  finish_task_switch,

with kcore:

810cf1b0:   mov$0x1,%esi
810cf1b5:   mov$0x4,%edi
810cf1ba:   callq  0x811aced0
810cf1bf:   andb   $0xfb,0x4c4(%rbx)
810cf1c6:   jmpq   0x810cf0e9
810cf1cb:   mov%rbx,%rsi
810cf1ce:   mov%r13,%rdi
810cf1d1:   callq  0x811a7e70
810cf1d6:   jmpq   0x810cf0e4

with debuginfo:

810cf1b0:   mov$0x1,%esi
810cf1b5:   mov$0x4,%edi
810cf1ba:   callq  811aced0 <___perf_sw_event>
810cf1bf:   andb   $0xfb,0x4c4(%rbx)
810cf1c6:   jmpq   810cf0e9 
810cf1cb:   mov%rbx,%rsi
810cf1ce:   mov%r13,%rdi
810cf1d1:   callq  811a7e70 <__perf_event_task_sched_in>
810cf1d6:   jmpq   810cf0e4 

call__parse tries to find symbol from angle brackets which is not present
in case of kcore.

-Ravi


> - Arnaldo
>
>> Suggested-by: Michael Ellerman 
>> [Suggested about 'bctrl' instruction]
>> Signed-off-by: Ravi Bangoria 
>> ---
>> Changes in v6:
>>   - No change
>>
>>  tools/perf/util/annotate.c | 8 ++--
>>  1 file changed, 2 insertions(+), 6 deletions(-)
>>
>> diff --git a/tools/perf/util/annotate.c b/tools/perf/util/annotate.c
>> index ea07588..a05423b 100644
>> --- a/tools/perf/util/annotate.c
>> +++ b/tools/perf/util/annotate.c
>> @@ -81,16 +81,12 @@ static int call__parse(struct ins_operands *ops, const 
>> char *norm_arch)
>>  return ops->target.name == NULL ? -1 : 0;
>>  
>>  indirect_call:
>> -tok = strchr(endptr, '(');
>> -if (tok != NULL) {
>> +tok = strchr(endptr, '*');
>> +if (tok == NULL) {
>>  ops->target.addr = 0;
>>  return 0;
>>  }
>>  
>> -tok = strchr(endptr, '*');
>> -if (tok == NULL)
>> -return -1;
>> -
>>  ops->target.addr = strtoull(tok + 1, NULL, 16);
>>  return 0;
>>  }
>> -- 
>> 2.5.5



Re: powerpc/mm: Update the FORCE_MAX_ZONEORDER range to enable hugetlb

2016-09-20 Thread Michael Ellerman
On Mon, 2016-19-09 at 17:31:33 UTC, "Aneesh Kumar K.V" wrote:
> For hugetlb to work with 4K page size, we need the MAX_ORDER to be more
> than 13. When switching from a 64K page size to 4K linux page size using
> make nconfig, we endup with a CONFIG_FORCE_MAX_ZONEORDER value of 9.
> This results in 16M hugepage to be considered as a gigantic huge page
> which inturn can result in failure to setup hugepages if gigantic
> hugepage support is not enabled.
> 
> This also results in kernel crash with 4K radix configuration. We
> hit the below BUG_ON on radix
> 
>  kernel BUG at mm/huge_memory.c:364!
>  Oops: Exception in kernel mode, sig: 5 [#1]
>  SMP NR_CPUS=2048 NUMA PowerNV
>  Modules linked in:
>  CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.8.0-rc1-6-gbae9cc6 #1
>  task: c000f1af8000 task.stack: c000f1aec000
>  NIP: c0c5fa0c LR: c0c5f9d8 CTR: c0c5f9a4
>  REGS: c000f1aef920 TRAP: 0700   Not tainted (4.8.0-rc1-6-gbae9cc6)
>  MSR: 900102029033   CR: 24000844  
> XER: 
>  CFAR: c0c5f9e0 SOFTE: 1
> .
>  NIP [c0c5fa0c] hugepage_init+0x68/0x238
>  LR [c0c5f9d8] hugepage_init+0x34/0x238
> 
> Fixes: a7ee539584acf ("powerpc/Kconfig: Update config option based on page 
> size")
> 
> Reported-by: Santhosh 
> Signed-off-by: Aneesh Kumar K.V 
> Acked-by: Balbir Singh 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/d5a1e42cb4be016a45a787953d

cheers


Re: powerpc/64: syscall ABI

2016-09-20 Thread Michael Ellerman
On Wed, 2016-14-09 at 03:21:47 UTC, Nicholas Piggin wrote:
> Add some documentation for the 64-bit syscall ABI, which doesn't seem
> to be documented elsewhere.
> 
> This attempts to documented existing practice. The only small
> discrepancy is glibc clobbers not quite matching the kernel (e.g.,
> xer, some vsyscalls trash cr1 whereas glibc only clobbers cr0). These
> will be resolved after this document is merged.
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/7b8845a2a2ec4d9658225cf9c5

cheers


Re: powerpc/64: replay hypervisor maintenance priority

2016-09-20 Thread Michael Ellerman
On Wed, 2016-14-09 at 03:01:21 UTC, Nicholas Piggin wrote:
> The hmi is defined to be higher priority than other maskable
> interrupts, so replay it first, as a best-effort to replay according
> to hardware priorities.
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/e0e0d6b7390b61feb06350ef4a

cheers


Re: powerpc: Ensure .mem(init|exit).text are within _stext/_etext

2016-09-20 Thread Michael Ellerman
On Thu, 2016-15-09 at 05:11:59 UTC, Michael Ellerman wrote:
> In our linker script we open code the list of text sections, because we
> need to include the __ftr_alt sections, which are arch-specific.
> 
> This means we can't use TEXT_TEXT as defined in vmlinux.lds.h, and so we
> don't have the MEM_KEEP() logic for memory hotplug sections.
> 
> If we build the kernel with the gold linker, and with CONFIG_MEMORY_HOTPLUG=y,
> we see that functions marked __meminit can end up outside of the
> _stext/_etext range, and also outside of _sinittext/_einittext, eg:
> 
> c000 T _stext
> c09e A _etext
> c09e3f18 T hash__vmemmap_create_mapping
> c0ca T _sinittext
> c0d00844 T _einittext
> 
> This causes them to not be recognised as text by is_kernel_text(), and
> prevents them being patched by jump_label (and presumably ftrace/kprobes
> etc.).
> 
> Fix it by adding MEM_KEEP() directives, mirroring what TEXT_TEXT does.
> 
> This isn't a problem when CONFIG_MEMORY_HOTPLUG=n, because we use the
> standard INIT_TEXT_SECTION() and EXIT_TEXT macros from vmlinux.lds.h.
> 
> Signed-off-by: Michael Ellerman 
> Tested-by: Anton Blanchard 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/7de3b27bac47da9de08409df1d

cheers


Re: powerpc: Don't change the section in _GLOBAL()

2016-09-20 Thread Michael Ellerman
On Thu, 2016-15-09 at 00:40:20 UTC, Michael Ellerman wrote:
> Currently the _GLOBAL() macro unilaterally sets the assembler section to
> ".text" at the start of the macro. This is rude as the caller may be
> using a different section.
> 
> So let the caller decide which section to emit the code into. On big
> endian we do need to switch to the ".opd" section to emit the OPD, but
> do that with pushsection/popsection, thereby leaving the original
> section intact.
> 
> The only place I could find where this requires changes to the code is
> in misc_32.S, where we need to switch back to ".text" after
> flush_icache_range() which is in ".kprobes.text".
> 
> I verified that the order of all entries in System.map is unchanged
> after this patch. The actual addresses shift around slightly so you
> can't just diff the System.map.
> 
> Signed-off-by: Michael Ellerman 
> Reviewed-by: Nicholas Piggin 

Applied to powerpc next.

https://git.kernel.org/powerpc/c/bea2dfd47ef3f8612f1265

cheers


Re: powerpc: do not use kprobe section to exempt exception handlers

2016-09-20 Thread Michael Ellerman
On Thu, 2016-15-09 at 14:29:44 UTC, Nicholas Piggin wrote:
> Use the blacklist macros instead. This allows the linker to move
> exception handler functions close to callers and avoids trampolines in
> larger kernels.
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/03465f899bdac70d34f6ca447a

cheers


Re: powerpc/64s: optimise syscall entry for virtual, relocatable case

2016-09-20 Thread Michael Ellerman
On Thu, 2016-15-09 at 09:03:21 UTC, Nicholas Piggin wrote:
> The mflr r10 instruction was left over saving of lr when the code used
> lr to branch to system_call_entry from the exception handler. That was
> changed by 6a404806d to use the count register. The value is never used
> now, so mflr can be removed, and r10 can be used for storage rather than
> spilling to the SPR scratch register.
> 
> The scratch register spill causes a long pipeline stall due to the SPR
> read after write. This change brings getppid syscall cost from 406 to
> 376 cycles on POWER8. getppid for non-relocatable case is 371 cycles.
> 
> Signed-off-by: Nicholas Piggin 
> Acked-by: Balbir Singh 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/18e3f56b1cacb96017e2a66844

cheers


Re: MAINTAINERS: Update cxl maintainers

2016-09-20 Thread Michael Ellerman
On Fri, 2016-16-09 at 04:28:44 UTC, Michael Neuling wrote:
> Fred has taken over the cxl maintenance I was doing.  This updates the
> MAINTAINERS file to reflect this.
> 
> It also removes a duplicate entry in the files covered.
> 
> Signed-off-by: Michael Neuling 
> Reviewed-by: Andrew Donnellan 
> Reviewed-by: Andrew Donnellan 
> Acked-by: Frederic Barrat 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/9d82fd2fae925efdf546cc25af

cheers


Re: powerpc/pseries: fix memory leak in queue_hotplug_event() error path

2016-09-20 Thread Michael Ellerman
On Mon, 2016-19-09 at 06:41:32 UTC, Andrew Donnellan wrote:
> If we fail to allocate work, we don't end up using hp_errlog_copy. Free it
> in the error path.
> 
> Signed-off-by: Andrew Donnellan 
> Reviewed-by: Nathan Fontenot 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/90ce35145cb622b3cd0007d50e

cheers


Re: powerpc/nvram: Fix an incorrect partition merge

2016-09-20 Thread Michael Ellerman
On Thu, 2015-10-12 at 07:30:02 UTC, xinhui wrote:
> From: Pan Xinhui 
> 
> When we merge two contiguous partitions whose signatures are marked
> NVRAM_SIG_FREE, We need update prev's length and checksum, then write it
> to nvram, not cur's. So lets fix this mistake now.
> 
> Also use memset instead of strncpy to set the partition's name. It's
> more readable if we want to fill up with duplicate chars .
> 
> Signed-off-by: Pan Xinhui 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/11b7e154b132232535befe51c5

cheers


Re: powerpc/nvram: Fix a memory leak in err path

2016-09-20 Thread Michael Ellerman
On Wed, 2015-09-12 at 10:00:53 UTC, xinhui wrote:
> If kmemdup fails, We need kfree *buff* first then return -ENOMEM.
> Otherwise there is a memory leak.
> 
> Signed-off-by: Pan Xinhui 
> Reviewed-by: Nathan Fontenot 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/0d0fecc5b5bfddb0e67bef985c

cheers


Re: powerpc/64s: exception optimise MSR handling

2016-09-20 Thread Michael Ellerman
On Thu, 2016-15-09 at 09:04:46 UTC, Nicholas Piggin wrote:
> mtmsrd with L=1 only affects MSR_EE and MSR_RI bits, and we always
> know what state those bits are, so the kernel MSR does not need to be
> loaded when modifying them.
> 
> mtmsrd is often in the critical execution path, so avoiding dependency
> on even L1 load is noticable. On a POWER8 this saves about 3 cycles
> from the syscall path, and possibly a few from other exception returns
> (not measured).
> 
> Signed-off-by: Nicholas Piggin 

Applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/49d09bf2a66f4b5a6eabefb0d4

cheers


Re: [PATCH][RFC] Implement arch primitives for busywait loops

2016-09-20 Thread Nicholas Piggin
On Tue, 20 Sep 2016 14:35:45 +0200
Christian Borntraeger  wrote:

> On 09/20/2016 02:27 PM, Nicholas Piggin wrote:
> > On Tue, 20 Sep 2016 13:19:30 +0200
> > Christian Borntraeger  wrote:
> >   
> >> On 09/16/2016 10:57 AM, Nicholas Piggin wrote:  
> >>> Implementing busy wait loops with cpu_relax() in callers poses
> >>> some difficulties for powerpc.
> >>>
> >>> First, we want to put our SMT thread into a low priority mode for the
> >>> duration of the loop, but then return to normal priority after exiting
> >>> the loop.  Dependong on the CPU design, 'HMT_low() ; HMT_medium();' as
> >>> cpu_relax() does may have HMT_medium take effect before HMT_low made
> >>> any (or much) difference.
> >>>
> >>> Second, it can be beneficial for some implementations to spin on the
> >>> exit condition with a statically predicted-not-taken branch (i.e.,
> >>> always predict the loop will exit).
> >>>
> >>> This is a quick RFC with a couple of users converted to see what
> >>> people think. I don't use a C branch with hints, because we don't want
> >>> the compiler moving the loop body out of line, which makes it a bit
> >>> messy unfortunately. If there's a better way to do it, I'm all ears.
> >>>
> >>> I would not propose to switch all callers immediately, just some
> >>> core synchronisation primitives.
> >> Just a FYA,
> >>
> >> On s390 we have a private version of cpu_relax that yields the cpu
> >> time slice back to the hypervisor via a hypercall.  
> > 
> > The powerpc guest also wants to yield to hypervisor in some busywait
> > situations.
> >   
> >> As this turned out
> >> to be problematic in some cases there is also now a cpu_relax_lowlatency.
> >>
> >> Now, this seems still problematic as there are too many places still 
> >> using cpu_relax instead of cpu_relax_lowlatency. So my plan is to do 
> >> a change of that, make cpu_relax just be a barrier and add a new 
> >> cpu_relax_yield that gives up the time slice. (so that s390 cpu_relax
> >> is just like any other cpu_relax)
> >>
> >> As far as I can tell the only place where I want to change cpu_relax
> >> to cpu_relax_lowlatency after that change is the stop machine run 
> >> code, so I hope to have no conflicts with your changes.  
> > 
> > I don't think there should be any conflicts, but it would be good to
> > make sure busy wait primitives can be usable by s390. So I can add
> > _yield variants that can do the right thing for s390.  
> 
> I was distracted by "more important work" (TM) but I will put you on
> CC when ready.
> > 
> > I need to think more about virtualization, so I'm glad you commented.
> > Powerpc would like to be told when a busywait loop knows the CPU it is
> > waiting for. So perhaps also a _yield_to_cpu variant as well.  
> 
> Yes, we also have 2 hypercalls: one that yields somehow and one that yields
> to a specific CPU. The latter is strongly preferred.

Okay, sounds good. I'll send out some updated patches soon too, so I'll
cc you on those. It would be good to come up with some basic guidelines
for when to use each variant too.

Thanks,
Nick


Re: [PATCH][RFC] Implement arch primitives for busywait loops

2016-09-20 Thread Christian Borntraeger
On 09/20/2016 02:27 PM, Nicholas Piggin wrote:
> On Tue, 20 Sep 2016 13:19:30 +0200
> Christian Borntraeger  wrote:
> 
>> On 09/16/2016 10:57 AM, Nicholas Piggin wrote:
>>> Implementing busy wait loops with cpu_relax() in callers poses
>>> some difficulties for powerpc.
>>>
>>> First, we want to put our SMT thread into a low priority mode for the
>>> duration of the loop, but then return to normal priority after exiting
>>> the loop.  Dependong on the CPU design, 'HMT_low() ; HMT_medium();' as
>>> cpu_relax() does may have HMT_medium take effect before HMT_low made
>>> any (or much) difference.
>>>
>>> Second, it can be beneficial for some implementations to spin on the
>>> exit condition with a statically predicted-not-taken branch (i.e.,
>>> always predict the loop will exit).
>>>
>>> This is a quick RFC with a couple of users converted to see what
>>> people think. I don't use a C branch with hints, because we don't want
>>> the compiler moving the loop body out of line, which makes it a bit
>>> messy unfortunately. If there's a better way to do it, I'm all ears.
>>>
>>> I would not propose to switch all callers immediately, just some
>>> core synchronisation primitives.  
>> Just a FYA,
>>
>> On s390 we have a private version of cpu_relax that yields the cpu
>> time slice back to the hypervisor via a hypercall.
> 
> The powerpc guest also wants to yield to hypervisor in some busywait
> situations.
> 
>> As this turned out
>> to be problematic in some cases there is also now a cpu_relax_lowlatency.
>>
>> Now, this seems still problematic as there are too many places still 
>> using cpu_relax instead of cpu_relax_lowlatency. So my plan is to do 
>> a change of that, make cpu_relax just be a barrier and add a new 
>> cpu_relax_yield that gives up the time slice. (so that s390 cpu_relax
>> is just like any other cpu_relax)
>>
>> As far as I can tell the only place where I want to change cpu_relax
>> to cpu_relax_lowlatency after that change is the stop machine run 
>> code, so I hope to have no conflicts with your changes.
> 
> I don't think there should be any conflicts, but it would be good to
> make sure busy wait primitives can be usable by s390. So I can add
> _yield variants that can do the right thing for s390.

I was distracted by "more important work" (TM) but I will put you on
CC when ready.
> 
> I need to think more about virtualization, so I'm glad you commented.
> Powerpc would like to be told when a busywait loop knows the CPU it is
> waiting for. So perhaps also a _yield_to_cpu variant as well.

Yes, we also have 2 hypercalls: one that yields somehow and one that yields
to a specific CPU. The latter is strongly preferred.
> 
> Something that will work with mutex_spin_on_owner and similar would be
> nice too. As far as I can tell, powerpc may want to yield to hypervisor
> when the owner's vcpu is scheduled off in that case too.
> 



Re: [PATCH][RFC] Implement arch primitives for busywait loops

2016-09-20 Thread Nicholas Piggin
On Tue, 20 Sep 2016 13:19:30 +0200
Christian Borntraeger  wrote:

> On 09/16/2016 10:57 AM, Nicholas Piggin wrote:
> > Implementing busy wait loops with cpu_relax() in callers poses
> > some difficulties for powerpc.
> > 
> > First, we want to put our SMT thread into a low priority mode for the
> > duration of the loop, but then return to normal priority after exiting
> > the loop.  Dependong on the CPU design, 'HMT_low() ; HMT_medium();' as
> > cpu_relax() does may have HMT_medium take effect before HMT_low made
> > any (or much) difference.
> > 
> > Second, it can be beneficial for some implementations to spin on the
> > exit condition with a statically predicted-not-taken branch (i.e.,
> > always predict the loop will exit).
> > 
> > This is a quick RFC with a couple of users converted to see what
> > people think. I don't use a C branch with hints, because we don't want
> > the compiler moving the loop body out of line, which makes it a bit
> > messy unfortunately. If there's a better way to do it, I'm all ears.
> > 
> > I would not propose to switch all callers immediately, just some
> > core synchronisation primitives.  
> Just a FYA,
> 
> On s390 we have a private version of cpu_relax that yields the cpu
> time slice back to the hypervisor via a hypercall.

The powerpc guest also wants to yield to hypervisor in some busywait
situations.

> As this turned out
> to be problematic in some cases there is also now a cpu_relax_lowlatency.
> 
> Now, this seems still problematic as there are too many places still 
> using cpu_relax instead of cpu_relax_lowlatency. So my plan is to do 
> a change of that, make cpu_relax just be a barrier and add a new 
> cpu_relax_yield that gives up the time slice. (so that s390 cpu_relax
> is just like any other cpu_relax)
> 
> As far as I can tell the only place where I want to change cpu_relax
> to cpu_relax_lowlatency after that change is the stop machine run 
> code, so I hope to have no conflicts with your changes.

I don't think there should be any conflicts, but it would be good to
make sure busy wait primitives can be usable by s390. So I can add
_yield variants that can do the right thing for s390.

I need to think more about virtualization, so I'm glad you commented.
Powerpc would like to be told when a busywait loop knows the CPU it is
waiting for. So perhaps also a _yield_to_cpu variant as well.

Something that will work with mutex_spin_on_owner and similar would be
nice too. As far as I can tell, powerpc may want to yield to hypervisor
when the owner's vcpu is scheduled off in that case too.

Thanks,
Nick


Re: [PATCH][RFC] Implement arch primitives for busywait loops

2016-09-20 Thread Christian Borntraeger
On 09/16/2016 10:57 AM, Nicholas Piggin wrote:
> Implementing busy wait loops with cpu_relax() in callers poses
> some difficulties for powerpc.
> 
> First, we want to put our SMT thread into a low priority mode for the
> duration of the loop, but then return to normal priority after exiting
> the loop.  Dependong on the CPU design, 'HMT_low() ; HMT_medium();' as
> cpu_relax() does may have HMT_medium take effect before HMT_low made
> any (or much) difference.
> 
> Second, it can be beneficial for some implementations to spin on the
> exit condition with a statically predicted-not-taken branch (i.e.,
> always predict the loop will exit).
> 
> This is a quick RFC with a couple of users converted to see what
> people think. I don't use a C branch with hints, because we don't want
> the compiler moving the loop body out of line, which makes it a bit
> messy unfortunately. If there's a better way to do it, I'm all ears.
> 
> I would not propose to switch all callers immediately, just some
> core synchronisation primitives.
Just a FYA,

On s390 we have a private version of cpu_relax that yields the cpu
time slice back to the hypervisor via a hypercall. As this turned out
to be problematic in some cases there is also now a cpu_relax_lowlatency.

Now, this seems still problematic as there are too many places still 
using cpu_relax instead of cpu_relax_lowlatency. So my plan is to do 
a change of that, make cpu_relax just be a barrier and add a new 
cpu_relax_yield that gives up the time slice. (so that s390 cpu_relax
is just like any other cpu_relax)

As far as I can tell the only place where I want to change cpu_relax
to cpu_relax_lowlatency after that change is the stop machine run 
code, so I hope to have no conflicts with your changes.
> 
> ---
>  arch/powerpc/include/asm/processor.h | 22 ++
>  include/asm-generic/barrier.h|  7 ++-
>  include/linux/bit_spinlock.h |  5 ++---
>  include/linux/cgroup.h   |  7 ++-
>  include/linux/seqlock.h  | 10 --
>  5 files changed, 32 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/processor.h 
> b/arch/powerpc/include/asm/processor.h
> index 68e3bf5..e10aee2 100644
> --- a/arch/powerpc/include/asm/processor.h
> +++ b/arch/powerpc/include/asm/processor.h
> @@ -402,6 +402,28 @@ static inline unsigned long __pack_fe01(unsigned int 
> fpmode)
> 
>  #ifdef CONFIG_PPC64
>  #define cpu_relax()  do { HMT_low(); HMT_medium(); barrier(); } while (0)
> +
> +#define spin_do  \
> +do { \
> + HMT_low();  \
> + __asm__ __volatile__ (  "1010:");
> +
> +#define spin_while(cond) \
> + barrier();  \
> + __asm__ __volatile__ (  "cmpdi  %0,0\n\t"   \
> + "beq-   1010b   \n\t"   \
> + : : "r" (cond));\
> + HMT_medium();   \
> +} while (0)
> +
> +#define spin_until(cond) \
> + barrier();  \
> + __asm__ __volatile__ (  "cmpdi  %0,0\n\t"   \
> + "bne-   1010b   \n\t"   \
> + : : "r" (cond));\
> + HMT_medium();   \
> +} while (0)
> +
>  #else
>  #define cpu_relax()  barrier()
>  #endif
> diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
> index fe297b5..4c53b3a 100644
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -235,12 +235,9 @@ do { 
> \
>  #define smp_cond_load_acquire(ptr, cond_expr) ({ \
>   typeof(ptr) __PTR = (ptr);  \
>   typeof(*ptr) VAL;   \
> - for (;;) {  \
> + spin_do {   \
>   VAL = READ_ONCE(*__PTR);\
> - if (cond_expr)  \
> - break;  \
> - cpu_relax();\
> - }   \
> + } spin_until (cond_expr);   \
>   smp_acquire__after_ctrl_dep();  \
>   VAL;\
>  })
> diff --git a/include/linux/bit_spinlock.h b/include/linux/bit_spinlock.h
> index 3b5bafc..695743c 100644
> --- a/include/linux/bit_spinlock.h
> +++ b/include/linux/bit_spinlock.h
> @@ -25,9 +25,8 @@ static inline void bit_spin_lock(int bitnum, unsigned long 
> *addr)
>  #if defined(CONFIG_SMP) || 

Re: [PATCH 2/2] powernv:idle:Implement lite variant of power_enter_stop

2016-09-20 Thread Gautham R Shenoy
Hi Balbir,

On Tue, Sep 20, 2016 at 03:54:43PM +1000, Balbir Singh wrote:
> > diff --git a/arch/powerpc/platforms/powernv/idle.c 
> > b/arch/powerpc/platforms/powernv/idle.c
> > index 479c256..c3d3fed 100644
> > --- a/arch/powerpc/platforms/powernv/idle.c
> > +++ b/arch/powerpc/platforms/powernv/idle.c
> > @@ -244,8 +244,15 @@ static DEVICE_ATTR(fastsleep_workaround_applyonce, 
> > 0600,
> >  static void power9_idle(void)
> >  {
> > /* Requesting stop state 0 */
> > -   power9_idle_stop(0);
> > +   power9_idle_stop(0, 0);
> >  }
> > +
> > +static void power9_idle_lite(void)
> > +{
> > +   /* Requesting stop state 0 with ESL=EC=0 */
> > +   power9_idle_stop(0, 1);
> > +}
> > +
> >  /*
> >   * First deep stop state. Used to figure out when to save/restore
> >   * hypervisor context.
> > @@ -414,8 +421,12 @@ static int __init pnv_init_idle_states(void)
> >  
> > if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED)
> > ppc_md.power_save = power7_idle;
> > -   else if (supported_cpuidle_states & OPAL_PM_STOP_INST_FAST)
> > -   ppc_md.power_save = power9_idle;
> > +   else if (supported_cpuidle_states & OPAL_PM_STOP_INST_FAST) {
> > +   if (supported_cpuidle_states & OPAL_PM_WAKEUP_AT_NEXT_INST)
> > +   ppc_md.power_save = power9_idle_lite;
> > +   else
> > +   ppc_md.power_save = power9_idle;
> > +   }
> 
> If I am reading this correctly we decide at boot time whether we support
> wakeup at next instruction and make that the default sleep state.
> I am a little surprised that these are exclusive. I was expecting
> power9_idle_lite to be one of the states to go into before
> power9_idle

At boot time, we initialize ppc_md.power_save to
power9_idle/power9_idle_lite which ends up being the default idle
function in the absence of the cpuidle subsystem. When cpuidle is
available, idle code will call cpuidle governors which will determine
the appropriate idle state that can be entered into. Each of these
idle states has an associated callback function. In case of the
idle-states without OPAL_PM_STOP_INST_FAST associated with them, the
callback is stop_loop() and when the flag is set, the callback
function is stop_lite_loop().  So when cpuidle is present, these
states are not exclusive.

Note that both power9_idle() and power9_idle_lite() call stop0. Just
that the former executes stop0 with ESL=EC=1 and latter with ESL=EC=0.

That said, you're right that in the absence of the cpuidle governor,
if the lite variant of stop is advertised by the firmware through the
device-tree, we end up picking the liter version of stop0 as the
default idle state. Do you suggest that we retain power9_idle which
calls stop0 with ESL=EC=1 ? 

> Balbir Singh.
> 
--
Thanks and Regards
gautham.



Re: [2/2] Detect instruction fetch denied and report

2016-09-20 Thread Balbir Singh


On 20/09/16 16:35, Michael Ellerman wrote:
> On Mon, 2016-22-08 at 01:56:57 UTC, Balbir Singh wrote:
>> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
>> index a4db22f..f162e77 100644
>> --- a/arch/powerpc/mm/fault.c
>> +++ b/arch/powerpc/mm/fault.c
>> @@ -404,6 +404,10 @@ good_area:
>>  (cpu_has_feature(CPU_FTR_NOEXECUTE) ||
>>   !(vma->vm_flags & (VM_READ | VM_WRITE
>>  goto bad_area;
>> +#ifdef CONFIG_PPC_RADIX_MMU
> 
> We shouldn't need the #ifdef, radix_enabled() will be false.
> 
>> +if (radix_enabled() && regs->msr & PPC_BIT(35))
>> +goto bad_area;
> 
> Is it really architected as radix only?
> 
> Personally I dislike PPC_BIT(), I'd rather you just used 0x1000. That way
> when I'm staring at a register dump I have some chance of spotting that mask.
> 
> Also brackets around the bitwise & would make me feel more comfortable.
> 
> cheers
> 


Will make the changes and submit

Balbir


Re: [v2] Fix __tlbiel in hash_native_64

2016-09-20 Thread Michael Ellerman
On Tue, 2016-13-09 at 06:40:07 UTC, Balbir Singh wrote:
> __tlbie and __tlbiel are out of sync. __tlbie does the right thing
> it calls tlbie with "tlbie rb, L" if CPU_FTR_ARCH_206 (cpu feature) is clear
> and with "tlbie rb" otherwise. During the cleanup of __tlbiel I noticed
> that __tlbiel was setting bit 11 PPC_BIT(21) independent of the ISA
> version for non-4k (L) pages. This patch fixes that issue. It also changes
> the current PPC_TLBIEL to PPC_TLBIEL_5 and introduces a new PPC_TLBIEL similar
> to PPC_TLBIE.

This should give more description of what the problem is, ie. we're setting a
bit in the instruction which is reserved on some processors, before talking
about the solution or what tlbie does.
 
> The arguments to PPC_TLBIE have also been changed/switched in order
> to be consistent with the actual assembly usage for clearer reading
> of code.

The whole thing should be split in 3 patches:

One which fixes the buglet where we set L=1 on 2.06 and 2.07.
One which renames PPC_TLBIEL() to PPC_TLBIEL_5().
One which switches the order of the arguments to PPC_TLBIEL().

Also do we even need a macro for the two argument form? We've had uses of the
two argument tlbie since 2011, so I suspect the assembler will happily accept
it?

cheers


Re: [2/2] Detect instruction fetch denied and report

2016-09-20 Thread Michael Ellerman
On Mon, 2016-22-08 at 01:56:57 UTC, Balbir Singh wrote:
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index a4db22f..f162e77 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -404,6 +404,10 @@ good_area:
>   (cpu_has_feature(CPU_FTR_NOEXECUTE) ||
>!(vma->vm_flags & (VM_READ | VM_WRITE
>   goto bad_area;
> +#ifdef CONFIG_PPC_RADIX_MMU

We shouldn't need the #ifdef, radix_enabled() will be false.

> + if (radix_enabled() && regs->msr & PPC_BIT(35))
> + goto bad_area;

Is it really architected as radix only?

Personally I dislike PPC_BIT(), I'd rather you just used 0x1000. That way
when I'm staring at a register dump I have some chance of spotting that mask.

Also brackets around the bitwise & would make me feel more comfortable.

cheers


Re: PowerPC agpmode issues

2016-09-20 Thread Michel Dänzer
On 20/09/16 12:43 PM, Herminio Hernandez, Jr. wrote:
> 
> Yes to both, however when I set radeon.agpmode=1 most of the time the
> kernel freezes when booting. When I do get past that I get these errors:

[...]

> *[   11.415769] [drm:.radeon_agp_init [radeon]] *ERROR* Illegal AGP
> Mode: 1 (valid 4, 8), leaving at 8*

As you can see, AGP 1x isn't supported by your setup, only 4x or 8x. How
about radeon.agpmode=4?


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer