Re: RFC: Reducing the number of non volatile GPRs in the ppc64 kernel
I agree with Segher. We already know we have opportunities to do a better job with shrink-wrapping (pushing this kind of useless activity down past early exits), so having examples of code to look at to improve this would be useful. -- Bill Bill Schmidt, Ph.D. Linux on Power Toolchain IBM Linux Technology Center wschm...@us.ibm.com (507) 319-6873 From: Segher Boessenkool seg...@kernel.crashing.org To: Anton Blanchard an...@samba.org Cc: linuxppc-dev@lists.ozlabs.org, Michael Gschwind/Watson/IBM@IBMUS, Alan Modra amo...@gmail.com, Bill Schmidt/Rochester/IBM@IBMUS, Ulrich Weigand ulrich.weig...@de.ibm.com, pau...@samba.org Date: 08/05/2015 06:20 AM Subject:Re: RFC: Reducing the number of non volatile GPRs in the ppc64 kernel Hi Anton, On Wed, Aug 05, 2015 at 02:03:00PM +1000, Anton Blanchard wrote: While looking at traces of kernel workloads, I noticed places where gcc used a large number of non volatiles. Some of these functions did very little work, and we spent most of our time saving the non volatiles to the stack and reading them back. That is something that should be fixed in GCC -- do you have an example of such a function? It made me wonder if we have the right ratio of volatile to non volatile GPRs. Since the kernel is completely self contained, we could potentially change that ratio. Attached is a quick hack to gcc and the kernel to decrease the number of non volatile GPRs to 8. I'm not sure if this is a good idea (and if the volatile to non volatile ratio is right), but this gives us something to play with. Instead of the GCC hack you can add a bunch of -fcall-used-r14 etc. options; does that not work for you? Segher ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v2 2/4] powerpc/cputable: advertise DSCR support on P7/P7+
On Fri, 2013-05-03 at 17:48 -0700, Nishanth Aravamudan wrote: Signed-off-by: Nishanth Aravamudan n...@us.ibm.com diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c index ae9f433..a792157 100644 --- a/arch/powerpc/kernel/cputable.c +++ b/arch/powerpc/kernel/cputable.c @@ -98,6 +98,7 @@ extern void __restore_cpu_e6500(void); PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | \ PPC_FEATURE_TRUE_LE | \ PPC_FEATURE_PSERIES_PERFMON_COMPAT) +#define COMMON_USER2_POWER7 (PPC_FEATURE2_DSCR) #define COMMON_USER_POWER8 (COMMON_USER_PPC64 | PPC_FEATURE_ARCH_2_06 |\ PPC_FEATURE_SMT | PPC_FEATURE_ICACHE_SNOOP | \ PPC_FEATURE_TRUE_LE | \ @@ -428,6 +429,7 @@ static struct cpu_spec __initdata cpu_specs[] = { .cpu_name = POWER7 (architected), .cpu_features = CPU_FTRS_POWER7, .cpu_user_features = COMMON_USER_POWER7, + .cpu_user_features2 = COMMON_USER2_POWER7, .mmu_features = MMU_FTRS_POWER7, .icache_bsize = 128, .dcache_bsize = 128, @@ -458,6 +460,7 @@ static struct cpu_spec __initdata cpu_specs[] = { .cpu_name = POWER7 (raw), .cpu_features = CPU_FTRS_POWER7, .cpu_user_features = COMMON_USER_POWER7, + .cpu_user_features2 = COMMON_USER2_POWER7, .mmu_features = MMU_FTRS_POWER7, .icache_bsize = 128, .dcache_bsize = 128, @@ -475,6 +478,7 @@ static struct cpu_spec __initdata cpu_specs[] = { .cpu_name = POWER7+ (raw), .cpu_features = CPU_FTRS_POWER7, .cpu_user_features = COMMON_USER_POWER7, + .cpu_user_features = COMMON_USER2_POWER7, ^ Oops here, I think.Please consider applying this on top. Untested, but seems obvious. Fix a typo in setting COMMON_USER2_POWER7 bits to .cpu_user_features2 cpu specs table. Signed-off-by: Will Schmidt will_schm...@vnet.ibm.com diff --git a/arch/powerpc/kernel/cputable.c b/arch/powerpc/kernel/cputable.c index c60bbec..51eecb5 100644 --- a/arch/powerpc/kernel/cputable.c +++ b/arch/powerpc/kernel/cputable.c @@ -482,7 +482,7 @@ static struct cpu_spec __initdata cpu_specs[] = { .cpu_name = POWER7+ (raw), .cpu_features = CPU_FTRS_POWER7, .cpu_user_features = COMMON_USER_POWER7, - .cpu_user_features = COMMON_USER2_POWER7, + .cpu_user_features2 = COMMON_USER2_POWER7, .mmu_features = MMU_FTRS_POWER7, .icache_bsize = 128, .dcache_bsize = 128, .mmu_features = MMU_FTRS_POWER7, .icache_bsize = 128, .dcache_bsize = 128, ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: NAND BBT corruption on MPC83xx
Mike: It is not a permanent damage thing. A read disturb does no permanent damage to the chip but if the read disturb event involves more bits than can be corrected by your ECC code, it can do permanent damage to the *DATA* you've stored in that block. For this reason, a good flash management system manages to at least occasionally read through *ALL* of the in-use blocks in the device so that single-bit errors can be scrubbed out (read and successfully corrected) before an adjacent bit in the block also fails (which would eventually lead to a multi-bit error that might be beyond the ability to be corrected by the ECC). As far as I know (and I'm sure the list will correct me if I'm wrong! ;-) ), neither UBI nor UBIFS nor any Linux layer provides this routine scrubbing; you have to code it up yourself, probably by accessing the device at the UBI (underlying block device/LEB) layer. Atlant -Original Message- From: linux-mtd-boun...@lists.infradead.org [mailto:linux-mtd-boun...@lists.infradead.org] On Behalf Of Mike Hench Sent: Saturday, June 18, 2011 13:55 To: Scott Wood; Matthew L. Creech Cc: linuxppc-dev@lists.ozlabs.org; linux-...@lists.infradead.org Subject: RE: NAND BBT corruption on MPC83xx Scott Wood wrote: As for the corruption, could it be degradation from repeated reads of that one page? Read Disturb. I Did not know SLC did that. It just takes 10x as long as MLC, on the order of a million reads. Supposedly erasing the block fixes it. It is not a permanent damage thing. I was seeing ~9 hours before failure with heavy writes. ~4GByte/hour = 2M pages, total ~18 million reads before errors in that last block showed up. Cool. Now we know. Thanks. Mike Hench __ Linux MTD discussion mailing list http://lists.infradead.org/mailman/listinfo/linux-mtd/ This e-mail and the information, including any attachments, it contains are intended to be a confidential communication only to the person or entity to whom it is addressed and may contain information that is privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please immediately notify the sender and destroy the original message. Thank you. Please consider the environment before printing this email. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/1 v2 ] Add kernel parameter to disable batched hcalls
This introduces a pair of kernel parameters that can be used to disable the MULTITCE and BULK_REMOVE h-calls. By default, those hcalls are enabled, active, and good for throughput and performance. The ability to disable them will be useful for some of the PREEMPT_RT related investigation and work occurring on Power. Signed-off-by: Will Schmidt will_schm...@vnet.ibm.com cc: Olof Johansson o...@lixom.net cc: Anton Blanchard an...@samba.org cc: Benjamin Herrenschmidt b...@kernel.crashing.org --- v2 - Per feedback from Olof, the code is reworked to utilize kernel parameter runtime checks, rather than CONFIG options. - Added relevant change to kernel-parameters.txt diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e2c7487..5c40801 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -426,6 +426,10 @@ and is between 256 and 4096 characters. It is defined in the file bttv.pll= See Documentation/video4linux/bttv/Insmod-options bttv.tuner= and Documentation/video4linux/bttv/CARDLIST + bulk_remove=off [PPC] This parameter disables the use of the pSeries + firmware feature for flushing multiple hpte entries + at a time. + BusLogic= [HW,SCSI] See drivers/scsi/BusLogic.c, comment before function BusLogic_ParseDriverOptions(). @@ -1499,6 +1503,10 @@ and is between 256 and 4096 characters. It is defined in the file mtdparts= [MTD] See drivers/mtd/cmdlinepart.c. + multitce=off[PPC] This parameter disables the use of the pSeries + firmware feature for updating multiple TCE entries + at a time. + onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock] diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 902987d..e174a2f 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -625,3 +625,19 @@ void iommu_init_early_pSeries(void) set_pci_dma_ops(dma_iommu_ops); } +static int __init disable_multitce(char *str) +{ + if (strcmp(str,off)==0) { + if (firmware_has_feature(FW_FEATURE_LPAR)) { + if (firmware_has_feature(FW_FEATURE_MULTITCE)) { + printk(KERN_INFO Disabling MULTITCE firmware feature\n); + ppc_md.tce_build = tce_build_pSeriesLP; + ppc_md.tce_free = tce_free_pSeriesLP; + powerpc_firmware_features = ~FW_FEATURE_MULTITCE; + } + } + } + return 1; +} + +__setup(multitce=,disable_multitce); diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c index 0707653..82d15e7 100644 --- a/arch/powerpc/platforms/pseries/lpar.c +++ b/arch/powerpc/platforms/pseries/lpar.c @@ -599,6 +599,19 @@ static void pSeries_lpar_flush_hash_range(unsigned long number, int local) spin_unlock_irqrestore(pSeries_lpar_tlbie_lock, flags); } +static int __init disable_bulk_remove(char *str) +{ + if (strcmp(str,off)==0) { + if (firmware_has_feature(FW_FEATURE_BULK_REMOVE)) { + printk(KERN_INFO Disabling BULK_REMOVE firmware feature); + powerpc_firmware_features = ~FW_FEATURE_BULK_REMOVE; + } + } + return 1; +} + +__setup(bulk_remove=,disable_bulk_remove); + void __init hpte_init_lpar(void) { ppc_md.hpte_invalidate = pSeries_lpar_hpte_invalidate; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1 v3] Add kernel parameter to disable batched hcalls
This introduces a pair of kernel parameters that can be used to disable the MULTITCE and BULK_REMOVE h-calls. By default, those hcalls are enabled, active, and good for throughput and performance. The ability to disable them will be useful for some of the PREEMPT_RT related investigation and work occurring on Power. Signed-off-by: Will Schmidt will_schm...@vnet.ibm.com cc: Olof Johansson o...@lixom.net cc: Anton Blanchard an...@samba.org cc: Benjamin Herrenschmidt b...@kernel.crashing.org --- v2 - Per feedback from Olof, the code is reworked to utilize kernel parameter runtime checks, rather than CONFIG options. - Added relevant change to kernel-parameters.txt v3 - ran through checkpatch and fixed whitespace issues. - consolidated the if() staircases to save indentation. diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index e2c7487..5c40801 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -426,6 +426,10 @@ and is between 256 and 4096 characters. It is defined in the file bttv.pll= See Documentation/video4linux/bttv/Insmod-options bttv.tuner= and Documentation/video4linux/bttv/CARDLIST + bulk_remove=off [PPC] This parameter disables the use of the pSeries + firmware feature for flushing multiple hpte entries + at a time. + BusLogic= [HW,SCSI] See drivers/scsi/BusLogic.c, comment before function BusLogic_ParseDriverOptions(). @@ -1499,6 +1503,10 @@ and is between 256 and 4096 characters. It is defined in the file mtdparts= [MTD] See drivers/mtd/cmdlinepart.c. + multitce=off[PPC] This parameter disables the use of the pSeries + firmware feature for updating multiple TCE entries + at a time. + onenand.bdry= [HW,MTD] Flex-OneNAND Boundary Configuration Format: [die0_boundary][,die0_lock][,die1_boundary][,die1_lock] diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c index 902987d..e174a2f 100644 --- a/arch/powerpc/platforms/pseries/iommu.c +++ b/arch/powerpc/platforms/pseries/iommu.c @@ -625,3 +625,17 @@ void iommu_init_early_pSeries(void) set_pci_dma_ops(dma_iommu_ops); } +static int __init disable_multitce(char *str) +{ + if (strcmp(str, off) == 0 + firmware_has_feature(FW_FEATURE_LPAR) + firmware_has_feature(FW_FEATURE_MULTITCE)) { + printk(KERN_INFO Disabling MULTITCE firmware feature\n); + ppc_md.tce_build = tce_build_pSeriesLP; + ppc_md.tce_free = tce_free_pSeriesLP; + powerpc_firmware_features = ~FW_FEATURE_MULTITCE; + } + return 1; +} + +__setup(multitce=, disable_multitce); diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c index 0707653..82d15e7 100644 --- a/arch/powerpc/platforms/pseries/lpar.c +++ b/arch/powerpc/platforms/pseries/lpar.c @@ -599,6 +599,18 @@ static void pSeries_lpar_flush_hash_range(unsigned long number, int local) spin_unlock_irqrestore(pSeries_lpar_tlbie_lock, flags); } +static int __init disable_bulk_remove(char *str) +{ + if (strcmp(str, off) == 0 + firmware_has_feature(FW_FEATURE_BULK_REMOVE)) { + printk(KERN_INFO Disabling BULK_REMOVE firmware feature); + powerpc_firmware_features = ~FW_FEATURE_BULK_REMOVE; + } + return 1; +} + +__setup(bulk_remove=, disable_bulk_remove); + void __init hpte_init_lpar(void) { ppc_md.hpte_invalidate = pSeries_lpar_hpte_invalidate; ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/1] Add config option for batched hcalls
On Sat, 2010-09-25 at 22:49 -0500, Olof Johansson wrote: On Fri, Sep 24, 2010 at 04:44:15PM -0500, Will Schmidt wrote: Add a config option for the (batched) MULTITCE and BULK_REMOVE h-calls. By default, these options are on and are beneficial for performance and throughput reasons. If disabled, the code will fall back to using less optimal TCE and REMOVE hcalls. The ability to easily disable these options is useful for some of the PREEMPT_RT related investigation and work occurring on Power. Hi, I can see why it's useful to enable and disable, but these are all runtime-checked, wouldn't it be more useful to add a bootarg to handle it instead of adding some new config options that pretty much everyone will always go with the defaults on? The bits are set early, but from looking at where they're used, there doesn't seem to be any harm in disabling them later on when a bootarg is convenient to parse and deal with? It has the benefit of easier on/off testing, if that has any value for production debug down the road. Hi Olof, Thats a good idea, let me poke at this a bit more, see if I can get bootargs for this. Thanks, -Will -Olof ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH 1/1] Add config option for batched hcalls
Add a config option for the (batched) MULTITCE and BULK_REMOVE h-calls. By default, these options are on and are beneficial for performance and throughput reasons. If disabled, the code will fall back to using less optimal TCE and REMOVE hcalls. The ability to easily disable these options is useful for some of the PREEMPT_RT related investigation and work occurring on Power. Signed-off-by: Will Schmidt will_schm...@vnet.ibm.com cc: Anton Blanchard an...@samba.org cc: Benjamin Herrenschmidt b...@kernel.crashing.org --- diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index f0e6f28..0b5e6a9 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -81,3 +81,23 @@ config DTL which are accessible through a debugfs file. Say N if you are unsure. + +config BULK_REMOVE + bool Enable BULK_REMOVE + depends on PPC_PSERIES + default y + help + Enable the BULK_REMOVE option for the hash page code. + This relies on a hcall-bulk firmware feature, and + should be enabled for performance throughput. + +config MULTITCE + bool Enable MultiTCE + depends on PPC_PSERIES + default y + help + Enable the Multi-TCE code, allowing a single hcall to + update multiple TCE entries at one time. This relies + on a hcall-multi-tce firmware feature, and should be + enabled for performance throughput. + diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c index 0a4d8c..4327064 100644 --- a/arch/powerpc/platforms/pseries/firmware.c +++ b/arch/powerpc/platforms/pseries/firmware.c @@ -51,9 +51,13 @@ firmware_features_table[FIRMWARE_MAX_FEATURES] = { {FW_FEATURE_VIO,hcall-vio}, {FW_FEATURE_RDMA, hcall-rdma}, {FW_FEATURE_LLAN, hcall-lLAN}, +#if defined(CONFIG_BULK_REMOVE) {FW_FEATURE_BULK_REMOVE,hcall-bulk}, +#endif {FW_FEATURE_XDABR, hcall-xdabr}, +#if defined(CONFIG_MULTITCE) {FW_FEATURE_MULTITCE, hcall-multi-tce}, +#endif {FW_FEATURE_SPLPAR, hcall-splpar}, }; /* Build up the firmware features bitmask using the contents of ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 1/2]: Powerpc: Fix EHCA driver on relocatable kernel
On Thu, 19 Aug 2010 23:08:09 -0500 Sonny Rao sonny...@us.ibm.com wrote: Some modules (like eHCA) want to map all of kernel memory, for this to work with a relocated kernel, we need to export kernstart_addr so modules can use PHYSICAL_START and memstart_addr so they could use MEMORY_START. Note that the 32bit code already exports these symbols. Signed-off-By: Sonny Rao sonny...@us.ibm.com Acked-by: Alexander Schmidt al...@linux.vnet.ibm.com Index: common/arch/powerpc/mm/init_64.c === --- common.orig/arch/powerpc/mm/init_64.c 2010-08-16 02:38:33.0 -0500 +++ common/arch/powerpc/mm/init_64.c 2010-08-16 02:39:25.0 -0500 @@ -79,7 +79,9 @@ #endif /* CONFIG_PPC_STD_MMU_64 */ phys_addr_t memstart_addr = ~0; +EXPORT_SYMBOL(memstart_addr); phys_addr_t kernstart_addr; +EXPORT_SYMBOL(kernstart_addr); void free_initmem(void) { ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
dvh...@linux.vnet.ibm.com wrote on 09/02/2010 01:04:28 AM: Subject Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries With this in place, we no longer see the preempt_count dropping below zero. However, if I offline/online a CPU about 246 times I hit the opposite problem, a preempt_count() overflow. There appears to be a missing preempt_enable() somewhere in the offline/online paths. This (preempt_count overflow) also occurred in mainline (with CONFIG_PREEMPT=y) in 2.6.35, but not in 2.6.36-rc3. A bisect seems to indicate it was fixed with a7c2bb8279d20d853e43c34584eaf2b039de8026 powerpc: Re-enable preemption before cpu_die(). Which may look familiar. :-) It looks like this patch went to mainline (likely via the powerpc tree), but may have not gotten back into the -rt branch. -Will -- Darren Hart IBM Linux Technology Center Real-Time Linux Team___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
Ankita Garg ank...@in.ibm.com wrote on 08/19/2010 10:58:24 AM: Subject Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries Hi Darren, On Thu, Jul 22, 2010 at 11:24:13AM -0700, Darren Hart wrote: With some instrumentation we were able to determine that the preempt_count() appears to change across the extended_cede_processor() call. Specifically across the plpar_hcall_norets(H_CEDE) call. On PREEMPT_RT we call this with preempt_count=1 and return with preempt_count=0x. On mainline with CONFIG_PREEMPT=y, the value is different (0x65) but is still incorrect. I was trying to reproduce this issue on a 2.6.33.7-rt29 kernel. I could easily reproduce this on the RT kernel and not the non-RT kernel. However, I hit it every single time I did a cpu online operation. I also noticed that the issue persists even when I disable H_CEDE by passing the cede_offline=0 kernel commandline parameter. Could you pl confirm if you observe the same in your setup ? If you see it every time, double-check that you have http://patchwork.ozlabs.org/patch/60922/ or an equivalent in your tree. (The patch moves the preempt_enable_no_resched() call up above a call to cpu_die (). An earlier variation called a preempt_enable before calling start_secondary_resume()). Darren and I have been seeing different problems (different dedicated-processor LPARS on the same physical system). He's seeing scrambled preempt_count values, I'm tending to see a system hang/death [with processor backtraces showing .pseries_mach_cpu_die or .pseries_dedicated_idle_sleep as expected, but no processes running] . I'll be curious which you end up seeing. With 2.6.33.7-rt27 as of a few minutes ago, I'm still seeing a system hang/death. However, the issue still remains. Will spend few cycles looking into this issue. Also of interest is that this path cpu_idle()-cpu_die()-pseries_mach_cpu_die() to start_secondary() enters with a preempt_count=1 if it wasn't corrupted across the hcall. The early boot path from _start however appears to call start_secondary() with a preempt_count of 0. The following patch is most certainly not correct, but it does eliminate the situation on mainline 100% of the time (there is still a 25% reproduction rate on PREEMPT_RT). Can someone comment on: 1) How can the preempt_count() get mangled across the H_CEDE hcall? 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ? Hacked-up-by: Darren Hart dvh...@us.ibm.com Index: linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c === --- linux-2.6.33.6.orig/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ linux-2.6.33.6/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -138,6 +138,7 @@ static void pseries_mach_cpu_die(void) * Kernel stack will be reset and start_secondary() * will be called to continue the online operation. */ + preempt_count() = 0; start_secondary_resume(); } } -- Regards, Ankita Garg (ank...@in.ibm.com) Linux Technology Center IBM India Systems Technology Labs, Bangalore, India___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [RT,RFC] Hacks allowing -rt to run on POWER7 / Powerpc.
On Sun, 2010-07-11 at 02:49 -0500, Milton Miller wrote: On Fri, 09 Jul 2010 about 08:55:01 -, Will Schmidt wrote: We've been seeing some issues with userspace randomly SIGSEGV'ing while running the -RT kernels on POWER7 based systems. After lots of debugging, head scratching, and experimental changes to the code, the problem has been narrowed down such that we can avoid the problems by disabling the TLB batching. After some input from Ben and further debug, we've found that the restoration of the batch-active value near the end of __switch_to() seems to be the key.( The -RT related changes within arch/powerpc/kernel/processor.c __switch_to() do the equivalent of a arch_leave_lazy_mmu_mode() before calling _switch, use a hadbatch flag to indicate if batching was active, and then restore that batch-active value on the way out after the call to _switch_to.That particular code is in the -RT branch, and not found in mainline ) Deferring to Ben (or others in the know) for whether this is the proper solution or if there is something deeper, but.. I believe this is still on Ben's list of things to look at. Between then and now, I'll see if I can get Thomas to pick this up for the -RT tree to keep RT functional on P7 in the mean-time. A bit more debug info below. I looked at the patch and noticed 2 changes: 1) the batch is checked and cleared after local_irq_save 2) enabling the batch is skipped I talked to Will and had him try moving the local_irq_save above the check for the active batch. That alone did not seem to be enough. However, he confirmed that we are setting batch to active when it is already active in lazy_mmu_enter, meaning that batching is being turned on recursively. I suggested debug to check that irqs are off after the restore when re-enabling when our debug session timed out. Based on some of the debug suggestions from Milton: A WARN_ON for (!irqs_disabled) after local_irq_restore() did not show any hits. (while otherwise continuing to suffer from the tlb batching troubles). --- hard_irq_disable(); last = _switch(old_thread, new_thread); local_irq_restore(flags); WARN_ON(!irqs_disabled());-- #if defined(CONFIG_PPC64) defined(CONFIG_PREEMPT_RT) 1 if (hadbatch) { batch = __get_cpu_var(ppc64_tlb_batch); batch-active = 1; } #endif Another assortment of WARN_ONs in the arch_{enter,leave}_lazy_mmu_mode functions. As Milton stated above, the check for batch-active on the way into the arch_enter_* function did generate lots of hits, the other warn_ons did not. static inline void arch_enter_lazy_mmu_mode(void) { struct ppc64_tlb_batch *batch = get_cpu_var(ppc64_tlb_batch); //|-WARN_ON(batch-active); /* lots of hits if enabled */ |---WARN_ON(irqs_disabled()); /* nothing */ |---batch-active = 1; static inline void arch_leave_lazy_mmu_mode(void) { |---struct ppc64_tlb_batch *batch = get_cpu_var(ppc64_tlb_batch); |---WARN_ON(!batch-active);/* nothing.*/ |---WARN_ON(irqs_disabled());/* nothing */ milton diff -aurp linux-2.6.33.5-rt23.orig/arch/powerpc/kernel/process.c linux-2.6.33.5-rt23.exp/arch/powerpc/kernel/process.c --- linux-2.6.33.5-rt23.orig/arch/powerpc/kernel/process.c 2010-06-21 11:41:34.402513904 -0500 +++ linux-2.6.33.5-rt23.exp/arch/powerpc/kernel/process.c 2010-07-09 13:15:13.533269904 -0500 @@ -304,10 +304,6 @@ struct task_struct *__switch_to(struct t struct thread_struct *new_thread, *old_thread; unsigned long flags; struct task_struct *last; -#if defined(CONFIG_PPC64) defined (CONFIG_PREEMPT_RT) - struct ppc64_tlb_batch *batch; - int hadbatch; -#endif #ifdef CONFIG_SMP /* avoid complexity of lazy save/restore of fpu @@ -401,16 +397,6 @@ struct task_struct *__switch_to(struct t new_thread-start_tb = current_tb; } -#ifdef CONFIG_PREEMPT_RT - batch = __get_cpu_var(ppc64_tlb_batch); - if (batch-active) { - hadbatch = 1; - if (batch-index) { - __flush_tlb_pending(batch); - } - batch-active = 0; - } -#endif /* #ifdef CONFIG_PREEMPT_RT */ #endif local_irq_save(flags); @@ -425,16 +411,13 @@ struct task_struct *__switch_to(struct t * of sync. Hard disable here. */ hard_irq_disable(); - last = _switch(old_thread, new_thread); - - local_irq_restore(flags); #if defined(CONFIG_PPC64) defined(CONFIG_PREEMPT_RT) - if (hadbatch) { - batch = __get_cpu_var(ppc64_tlb_batch); - batch-active = 1; - } + arch_leave_lazy_mmu_mode(); #endif + last = _switch(old_thread, new_thread); + + local_irq_restore(flags); return last
Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries
dvh...@linux.vnet.ibm.com wrote on 07/22/2010 06:57:18 PM: Subject Re: [PATCH][RFC] preempt_count corruption across H_CEDE call with CONFIG_PREEMPT on pseries On 07/22/2010 03:25 PM, Benjamin Herrenschmidt wrote: On Thu, 2010-07-22 at 11:24 -0700, Darren Hart wrote: 1) How can the preempt_count() get mangled across the H_CEDE hcall? 2) Should we call preempt_enable() in cpu_idle() prior to cpu_die() ? The preempt count is on the thread info at the bottom of the stack. Can you check the stack pointers ? Hi Ben, thanks for looking. I instrumented the area around extended_cede_processor() as follows (please confirm I'm getting the stack pointer correctly). while (get_preferred_offline_state(cpu) == CPU_STATE_INACTIVE) { asm(mr %0,1 : =r (sp)); printk(before H_CEDE current-stack: %lx, pcnt: %x\n, sp, preempt_count()); extended_cede_processor(cede_latency_hint); asm(mr %0,1 : =r (sp)); printk(after H_CEDE current-stack: %lx, pcnt: %x\n, sp, preempt_count()); } On Mainline (2.6.33.6, CONFIG_PREEMPT=y) I see this: Jul 22 18:37:08 igoort1 kernel: before H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 Jul 22 18:37:08 igoort1 kernel: after H_CEDE current-stack: c0010e9e3ce0, pcnt: 1 This surprised me as preempt_count is 1 before and after, so no corruption appears to occur on mainline. This makes the pcnt of 65 I see without the preempt_count()=0 hack very strange. I ran several hundred off/on cycles. The issue of preempt_count being 1 is still addressed by this patch however. On PREEMPT_RT (2.6.33.5-rt23 - tglx, sorry, rt/2.6.33 next time, promise): Jul 22 18:51:11 igoort1 kernel: before H_CEDE current-stack: c00089bcfcf0, pcnt: 1 Jul 22 18:51:11 igoort1 kernel: after H_CEDE current-stack: c00089bcfcf0, pcnt: I'm not seeing the preempt_count value corrupted with my current set of debug, however, I have added buffers to the thread_info struct, so wonder if I've moved the preempt_count variable out of the way of the corruption. (Still investigating that point..) Why.. because I had been trying to set a DABR on the preempt_count value to catch the corrupter, and due to hits on the nearby _flags fields, getting false positives.. struct thread_info { |---struct task_struct *task;|--|---/* main task structure */ |---struct exec_domain *exec_domain;|---/* execution domain */ |---int||---cpu;|---|---|---/* cpu we're on */ |---int||---pad_buffer[64]; |---int||---preempt_count;|-|---/* 0 = preemptable, |---|---|---|---|---|--- 0 = BUG */ |---int||---pad_buffer2[256]; |---struct restart_block restart_block; |---unsigned long|--local_flags;|---|---/* private flags for thread */ |---/* low level flags - has atomic operations done on it */ |---unsigned long|--flags cacheline_aligned_in_smp; }; In both cases the stack pointer appears unchanged. Note: there is a BUG triggered in between these statements as the preempt_count causes the printk to trigger: Badness at kernel/sched.c:5572 Thanks, -- Darren Hart IBM Linux Technology Center Real-Time Linux Team ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH, RT, RFC] Hacks allowing -rt to run on POWER7 / Powerpc.
[PATCH, RT, RFC] Hacks allowing -rt to run on POWER7 / Powerpc. We've been seeing some issues with userspace randomly SIGSEGV'ing while running the -RT kernels on POWER7 based systems. After lots of debugging, head scratching, and experimental changes to the code, the problem has been narrowed down such that we can avoid the problems by disabling the TLB batching. After some input from Ben and further debug, we've found that the restoration of the batch-active value near the end of __switch_to() seems to be the key.( The -RT related changes within arch/powerpc/kernel/processor.c __switch_to() do the equivalent of a arch_leave_lazy_mmu_mode() before calling _switch, use a hadbatch flag to indicate if batching was active, and then restore that batch-active value on the way out after the call to _switch_to.That particular code is in the -RT branch, and not found in mainline ) Deferring to Ben (or others in the know) for whether this is the proper solution or if there is something deeper, but.. IF the right answer is to simply disable the restoration of batch-active, the rest of the CONFIG_PREEMPT_RT changes in __switch_to() should then be replaceable with a single call to arch_leave_lazy_mmu_mode(). The patch here is what I am currently running with, on both POWER6 and POWER7 systems, successfully. Signed-off-by: Will Schmidt will_schm...@vnet.ibm.com CC:Ben Herrenschmidt b...@kernel.crashing.org CC:Thomas Gleixner t...@linutronix.de --- diff -aurp linux-2.6.33.5-rt23.orig/arch/powerpc/kernel/process.c linux-2.6.33.5-rt23.exp/arch/powerpc/kernel/process.c --- linux-2.6.33.5-rt23.orig/arch/powerpc/kernel/process.c 2010-06-21 11:41:34.402513904 -0500 +++ linux-2.6.33.5-rt23.exp/arch/powerpc/kernel/process.c 2010-07-09 13:15:13.533269904 -0500 @@ -304,10 +304,6 @@ struct task_struct *__switch_to(struct t struct thread_struct *new_thread, *old_thread; unsigned long flags; struct task_struct *last; -#if defined(CONFIG_PPC64) defined (CONFIG_PREEMPT_RT) - struct ppc64_tlb_batch *batch; - int hadbatch; -#endif #ifdef CONFIG_SMP /* avoid complexity of lazy save/restore of fpu @@ -401,16 +397,6 @@ struct task_struct *__switch_to(struct t new_thread-start_tb = current_tb; } -#ifdef CONFIG_PREEMPT_RT - batch = __get_cpu_var(ppc64_tlb_batch); - if (batch-active) { - hadbatch = 1; - if (batch-index) { - __flush_tlb_pending(batch); - } - batch-active = 0; - } -#endif /* #ifdef CONFIG_PREEMPT_RT */ #endif local_irq_save(flags); @@ -425,16 +411,13 @@ struct task_struct *__switch_to(struct t * of sync. Hard disable here. */ hard_irq_disable(); - last = _switch(old_thread, new_thread); - - local_irq_restore(flags); #if defined(CONFIG_PPC64) defined(CONFIG_PREEMPT_RT) - if (hadbatch) { - batch = __get_cpu_var(ppc64_tlb_batch); - batch-active = 1; - } + arch_leave_lazy_mmu_mode(); #endif + last = _switch(old_thread, new_thread); + + local_irq_restore(flags); return last; } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Thu, 2010-05-20 at 11:05 +0200, Jan-Bernd Themann wrote: Hi Thomas Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY) On Thu, 20 May 2010, Jan-Bernd Themann wrote: Thought more about that. The case at hand (ehea) is nasty: The driver does _NOT_ disable the rx interrupt in the card in the rx interrupt handler - for whatever reason. Yeah I saw that, but I don't know why it's written that way. Perhaps Jan-Bernd or Doug will chime in and enlighten us? :) From our perspective there is no need to disable interrupts for the RX side as the chip does not fire further interrupts until we tell the chip to do so for a particular queue. We have multiple receive The traces tell a different story though: ehea_recv_irq_handler() napi_reschedule() eoi() ehea_poll() ... ehea_recv_irq_handler() ??? napi_reschedule() ... napi_complete() Can't tell whether you can see the same behaviour in mainline, but I don't see a reason why not. Is this the same interrupt we are seeing here, or do we see a second other interrupt popping up on the same CPU? As I said, with multiple receive queues (if enabled) you can have multiple interrupts in parallel. Same interrupt number (260). Per the trace data, the first ehea_recv_irq_handler (at 117.904525) was on cpu 0, the second (at 117.904689) was on cpu 1. ...-2180 [000] 117.904525: .ehea_recv_irq_handler: ENTER 0 c000e8bd08b0 ...-2180 [000] 117.904527: .ehea_recv_irq_handler: napi_reschedule COMpleted c000e8bd08b0 ...-2180 [000] 117.904528: .ehea_recv_irq_handler: EXIT reschedule(1) 1 c000e8bd08b0 ...-2180 [000] 117.904529: .xics_unmask_irq: xics: unmask virq 260 772 ...-2180 [000] 117.904547: .xics_unmask_irq: xics: unmask virq pre-xive 260 772 0 status:0 ff ...-2180 [000] 117.904586: .xics_unmask_irq: xics: unmask virq post-xive 260 772 0 D:11416 status:0 5 ...-2180 [000] 117.904602: .handle_fasteoi_irq: 260 8004000 ...-2180 [000] 117.904603: .xics_mask_irq: xics: mask virq 260 772 ...-2180 [000] 117.904634: .xics_mask_real_irq: xics: before: mask_real 772 status:0 5 ...-2180 [000] 117.904668: .xics_mask_real_irq: xics: after: mask_real 772 status:0 ff ...-2180 [000] 117.904669: .handle_fasteoi_irq: pre-action: 260 8004100 ...-2180 [000] 117.904671: .handle_fasteoi_irq: post-action: 260 8004100 ...-2180 [000] 117.904672: .handle_fasteoi_irq: exit. 260 8004000 ...-7 [000] 117.904681: .ehea_poll: ENTER 1 c000e8bd08b0 poll_counter:0 force:0 ...-7 [000] 117.904683: .ehea_proc_rwqes: ehea_check_cqe 0 ...-2180 [001] 117.904689: .ehea_recv_irq_handler: ENTER 1 c000e8bd08b0 ...-7 [000] 117.904690: .ehea_proc_rwqes: ehea_check_cqe 0 ...-2180 [001] 117.904691: .ehea_recv_irq_handler: napi_reschedule inCOMplete c000e8bd08b0 ...-2180 [001] 117.904692: .ehea_recv_irq_handler: EXIT reschedule(0) 1 c000e8bd08b0 ...-2180 [001] 117.904694: .xics_unmask_irq: xics: unmask virq 260 772 ...-7 [000] 117.904702: .ehea_refill_rq2: ehea_refill_rq2 ...-7 [000] 117.904703: .ehea_refill_rq_def: ehea_refill_rq_def ...-7 [000] 117.904704: .ehea_refill_rq3: ehea_refill_rq3 ...-7 [000] 117.904705: .ehea_refill_rq_def: ehea_refill_rq_def ...-7 [000] 117.904706: .napi_complete: napi_complete: ENTER state: 1 c000e8bd08b0 ...-7 [000] 117.904707: .napi_complete: napi_complete: EXIT state: 0 c000e8bd08b0 ...-7 [000] 117.904710: .ehea_poll: EXIT !cqe rx(2). 0 c000e8bd08b0 ...-2180 [001] 117.904719: .xics_unmask_irq: xics: unmask virq pre-xive 260 772 0 status:0 ff ...-2180 [001] 117.904761: .xics_unmask_irq: xics: unmask virq post-xive 260 772 0 D:12705 status:0 5 Pleaes check if multiple queues are enabled. The following module parameter is used for that: MODULE_PARM_DESC(use_mcs, 0:NAPI, 1:Multiple receive queues, Default = 0 ); No module parameters were used, should be plain old defaults. you should also see the number of used HEA interrupts in /proc/interrupts 256: 100000 00 XICS Level ehea_neq 259: 000000 00 XICS Level eth0-aff 260: 36196500000 00 XICS Level eth0-queue0 queues with an own interrupt each so that the interrupts can arrive on multiple CPUs in parallel. Interrupts are enabled again when we leave the NAPI Poll function for the corresponding receive queue. I can't see a piece of code which does that, but that's probably just lack of detailed hardware knowledge on my side. If you mean the re-enable piece of code, it is not very obvious, you are right. Interrupts are only generated if a particular register for our completion queues is written. We do this in the following line:
Re: [PATCH RT] ehea: make receive irq handler non-threaded (IRQF_NODELAY)
On Thu, 2010-05-20 at 16:45 +0200, Thomas Gleixner wrote: On Thu, 20 May 2010, Darren Hart wrote: On 05/20/2010 01:14 AM, Thomas Gleixner wrote: On Thu, 20 May 2010, Jan-Bernd Themann wrote: Thought more about that. The case at hand (ehea) is nasty: The driver does _NOT_ disable the rx interrupt in the card in the rx interrupt handler - for whatever reason. Yeah I saw that, but I don't know why it's written that way. Perhaps Jan-Bernd or Doug will chime in and enlighten us? :) From our perspective there is no need to disable interrupts for the RX side as the chip does not fire further interrupts until we tell the chip to do so for a particular queue. We have multiple receive The traces tell a different story though: ehea_recv_irq_handler() napi_reschedule() eoi() ehea_poll() ... ehea_recv_irq_handler() ??? napi_reschedule() ... napi_complete() Can't tell whether you can see the same behaviour in mainline, but I don't see a reason why not. I was going to suggest that because these are threaded handlers, perhaps they are rescheduled on a different CPU and then receive the interrupt for the other CPU/queue that Jan was mentioning. But, the handlers are affined if I remember correctly, and we aren't running with multiple receive queues. So, we're back to the same question, why are we seeing another irq. It comes in before napi_complete() and therefor before the ehea_reset*() block of calls which do the equivalent of re-enabling interrupts. Can you slap a few trace points into that driver with a stock mainline kernel and verify that ? 2.6.33.4 (non-rt kernel) with similar trace_printk hooks in place... Most data lumps look like so: idle-0 [000] 1097.685337: .handle_fasteoi_irq: ENTER 260 4000 idle-0 [000] 1097.685339: .handle_fasteoi_irq: pre-action 260 4100 idle-0 [000] 1097.685339: .ehea_recv_irq_handler: ENTER c000e8980700 idle-0 [000] 1097.685340: .ehea_recv_irq_handler: napi_schedule ... c000e8980700 idle-0 [000] 1097.685341: .ehea_recv_irq_handler: napi_schedule Calling __napi_schedule ... c000e8980700 idle-0 [000] 1097.685342: .ehea_recv_irq_handler: EXIT c000e8980700 idle-0 [000] 1097.685343: .handle_fasteoi_irq: post-action 260 4100 idle-0 [000] 1097.685344: .handle_fasteoi_irq: EXIT. 260 4000 idle-0 [000] 1097.685346: .ehea_poll: ENTER c000e8980700 idle-0 [000] 1097.685352: .napi_complete: napi_complete: ENTER c000e8980700 idle-0 [000] 1097.685352: .napi_complete: napi_complete: EXIT c000e8980700 idle-0 [000] 1097.685355: .ehea_poll: EXIT !cqe rx(1) c000e8980700 But I did see one like this, which shows a ehea_recv_irq_handler ENTER within a ehea_poll ENTER. (which I think is what you were expecting, or wanted to verify..) idle-0 [000] 1097.616261: .handle_fasteoi_irq: ENTER 260 4000 idle-0 [000] 1097.616262: .handle_fasteoi_irq: pre-action 260 4100 * idle-0 [000] 1097.616263: .ehea_recv_irq_handler: ENTER c000e8980700 idle-0 [000] 1097.616264: .ehea_recv_irq_handler: napi_schedule ... c000e8980700 idle-0 [000] 1097.616265: .ehea_recv_irq_handler: napi_schedule Calling __napi_schedule ... c000e8980700 idle-0 [000] 1097.616265: .ehea_recv_irq_handler: EXIT c000e8980700 idle-0 [000] 1097.616266: .handle_fasteoi_irq: post-action 260 4100 idle-0 [000] 1097.616268: .handle_fasteoi_irq: EXIT. 260 4000 * idle-0 [000] 1097.616270: .ehea_poll: ENTER c000e8980700 idle-0 [000] 1097.616282: .handle_fasteoi_irq: ENTER 260 4000 idle-0 [000] 1097.616283: .handle_fasteoi_irq: pre-action 260 4100 * idle-0 [000] 1097.616284: .ehea_recv_irq_handler: ENTER c000e8980700 idle-0 [000] 1097.616285: .ehea_recv_irq_handler: napi_schedule ... c000e8980700 idle-0 [000] 1097.616286: .ehea_recv_irq_handler: napi_schedule NOT Calling __napi_schedule... c000e8980700 idle-0 [000] 1097.616286: .ehea_recv_irq_handler: EXIT c000e8980700 idle-0 [000] 1097.616287: .handle_fasteoi_irq: post-action 260 4100 idle-0 [000] 1097.616289: .handle_fasteoi_irq: EXIT. 260 4000 idle-0 [000] 1097.616299: .napi_complete: napi_complete: ENTER c000e8980700 idle-0 [000] 1097.616300: .napi_complete: napi_complete: EXIT c000e8980700 idle-0 [000] 1097.616302: .ehea_poll: napi_reschedule COMpleted c000e8980700 idle-0 [000]
Re: [PATCH 25/37] drivers/infiniband: use .dev.of_node instead of .node in struct of_device
On Thu, 11 Mar 2010 11:57:53 -0800 Roland Dreier rdre...@cisco.com wrote: Seems fine... adding EHCA guys just to make sure. .node is being removed Signed-off-by: Grant Likely grant.lik...@secretlab.ca Acked-by: Alexander Schmidt al...@linux.vnet.ibm.com ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: hypervisor call tracepoints hcall_stats touchup.
On Wed, 2009-11-25 at 16:19 -0500, Steven Rostedt wrote: On Wed, 2009-11-25 at 10:12 -0600, Will Schmidt wrote: Tested-by: Will Schmidt will_schm...@vnet.ibm.com Signed-off-by: Will Schmidt will_schm...@vnet.ibm.com Isn't it assumed that the one that made the patch also tested it? Well I would hope that is the norm. I like to make that assumption, but wanted to be clear on the point for anyone with doubts. :-) -- Steve ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
hypervisor call tracepoints hcall_stats touchup.
The tb_total and purr_total values reported via the hcall_stats code should be cumulative, rather than being replaced by the latest delta tb or purr value. Tested-by: Will Schmidt will_schm...@vnet.ibm.com Signed-off-by: Will Schmidt will_schm...@vnet.ibm.com --- [ This is a touch-up to the [3/6] powerpc: tracing: Add hypervisor call tracepoints patch submitted by Anton a few weeks back, so I've copied folks Anton had on CC for his original patch, this fix is rather ppc specific, so can probably go in via the ppc tree, but I've no real preference. ] diff --git a/arch/powerpc/platforms/pseries/hvCall_inst.c b/arch/powerpc/platforms/pseries/hvCall_inst.c index 2f58c71..1fefae7 100644 --- a/arch/powerpc/platforms/pseries/hvCall_inst.c +++ b/arch/powerpc/platforms/pseries/hvCall_inst.c @@ -124,8 +124,8 @@ static void probe_hcall_exit(unsigned long opcode, unsigned long retval, h = __get_cpu_var(hcall_stats)[opcode / 4]; h-num_calls++; - h-tb_total = mftb() - h-tb_start; - h-purr_total = mfspr(SPRN_PURR) - h-purr_start; + h-tb_total += mftb() - h-tb_start; + h-purr_total += mfspr(SPRN_PURR) - h-purr_start; put_cpu_var(hcall_stats); } ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH] ehca: use port autodetect mode as default
This patch sets the port autodetect mode as default for the ehca driver. The autodetect code has been in the kernel for several releases now and has proved to be stable. --- Roland, please queue this change for 2.6.32 if you are okay with it. drivers/infiniband/hw/ehca/ehca_main.c |8 1 file changed, 4 insertions(+), 4 deletions(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_main.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_main.c @@ -52,7 +52,7 @@ #include ehca_tools.h #include hcp_if.h -#define HCAD_VERSION 0028 +#define HCAD_VERSION 0029 MODULE_LICENSE(Dual BSD/GPL); MODULE_AUTHOR(Christoph Raisch rai...@de.ibm.com); @@ -64,7 +64,7 @@ static int ehca_hw_level = 0; static int ehca_poll_all_eqs = 1; int ehca_debug_level = 0; -int ehca_nr_ports = 2; +int ehca_nr_ports = -1; int ehca_use_hp_mr = 0; int ehca_port_act_time = 30; int ehca_static_rate = -1; @@ -95,8 +95,8 @@ MODULE_PARM_DESC(hw_level, Hardware level (0: autosensing (default), 0x10..0x14: eHCA, 0x20..0x23: eHCA2)); MODULE_PARM_DESC(nr_ports, -number of connected ports (-1: autodetect, 1: port one only, -2: two ports (default)); +number of connected ports (-1: autodetect (default), +1: port one only, 2: two ports)); MODULE_PARM_DESC(use_hp_mr, Use high performance MRs (default: no)); MODULE_PARM_DESC(port_act_time, ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [ewg] Re: [PATCH 2.6.31 try 2] ehca: Tolerate dynamic memory operations and huge pages
On Mon, 22 Jun 2009 22:19:21 -0700 Roland Dreier rdre...@cisco.com wrote: thanks, applied. -#define HCAD_VERSION 0026 +#define HCAD_VERSION 0027 the driver version was already 0027 (since bde2cfaf), so I dropped this chunk. thank you for applying, we would like to increase the version number for this patch, so please also apply the following: ehca: Increment version number for DMEM toleration Signed-off-by: Alexander Schmidt al...@linux.vnet.ibm.com --- drivers/infiniband/hw/ehca/ehca_main.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_main.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_main.c @@ -52,7 +52,7 @@ #include ehca_tools.h #include hcp_if.h -#define HCAD_VERSION 0027 +#define HCAD_VERSION 0028 MODULE_LICENSE(Dual BSD/GPL); MODULE_AUTHOR(Christoph Raisch rai...@de.ibm.com); ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [ewg] Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
On Tue, 16 Jun 2009 09:10:39 -0700 Roland Dreier rdre...@cisco.com wrote: Yeah, the notifier code remains untouched as we still do not allow dynamic memory operations _while_ our module is loaded. The patch allows the driver to cope with DMEM operations that happened before the module was loaded, which might result in a non-contiguous memory layout. When the driver registers its global memory region in the system, the memory layout must be considered. We chose the term toleration instead of support to illustrate this. I see. So things just silently broke in some cases when the driver was loaded after operations you didn't tolerate? Anyway, thanks for the explanation. Well, things did not break silently. The registration of the MR failed with an error code which was reported to userspace. Will you push the patch for .31 or .32? Thanks, Alex ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2.6.31] ehca: Tolerate dynamic memory operations and huge pages
Hi Roland, thank you for taking a look at the code! On Fri, 12 Jun 2009 21:50:58 -0700 Roland Dreier rdre...@cisco.com wrote: OK, one major issue with this patch and a few minor nits. First, the major issue is that I don't see anything in the patch that changes the code in ehca_mem_notifier() in ehca_main.c: case MEM_GOING_ONLINE: case MEM_GOING_OFFLINE: /* only ok if no hca is attached to the lpar */ spin_lock_irqsave(shca_list_lock, flags); if (list_empty(shca_list)) { spin_unlock_irqrestore(shca_list_lock, flags); return NOTIFY_OK; } else { spin_unlock_irqrestore(shca_list_lock, flags); if (printk_timed_ratelimit(ehca_dmem_warn_time, 30 * 1000)) ehca_gen_err(DMEM operations are not allowed in conjunction with eHCA); return NOTIFY_BAD; } But your patch description says: This patch implements toleration of dynamic memory operations But it seems you're still going to hit the same NOTIFY_BAD case above after your patch. So something doesn't compute for me. Could you explain more? Yeah, the notifier code remains untouched as we still do not allow dynamic memory operations _while_ our module is loaded. The patch allows the driver to cope with DMEM operations that happened before the module was loaded, which might result in a non-contiguous memory layout. When the driver registers its global memory region in the system, the memory layout must be considered. We chose the term toleration instead of support to illustrate this. I'll put some more details into the changelog, incorporate the other comments and send out a second version of the patch. Thanks, Alex Second, a nit: +#define EHCA_REG_MR 0 +#define EHCA_REG_BUSMAP_MR (~0) and you pass these as the reg_busmap parm in: int ehca_reg_mr(struct ehca_shca *shca, struct ehca_mr *e_mr, u64 *iova_start, @@ -991,7 +1031,8 @@ struct ehca_pd *e_pd, struct ehca_mr_pginfo *pginfo, u32 *lkey, /*OUT*/ - u32 *rkey) /*OUT*/ + u32 *rkey, /*OUT*/ + int reg_busmap) and test it as: + if (reg_busmap) + ret = ehca_reg_bmap_mr_rpages(shca, e_mr, pginfo); + else + ret = ehca_reg_mr_rpages(shca, e_mr, pginfo); So the ~0 for true looks a bit odd. One option would be to make reg_busmap a bool, since that's how you're using it, but then you lose the nice self-documenting macro where you call things. So I think it would be cleaner to do something like enum ehca_reg_type { EHCA_REG_MR, EHCA_REG_BUSMAP_MR }; and make the int reg_busmap parameter into enum ehca_reg_type reg_type and have the code become + if (reg_type == EHCA_REG_BUSMAP_MR) + ret = ehca_reg_bmap_mr_rpages(shca, e_mr, pginfo); + else if (reg_type == EHCA_REG_MR) + ret = ehca_reg_mr_rpages(shca, e_mr, pginfo); + else + ret = -EINVAL or something like that. +struct ib_dma_mapping_ops ehca_dma_mapping_ops = { + .mapping_error = ehca_dma_mapping_error, + .map_single = ehca_dma_map_single, + .unmap_single = ehca_dma_unmap_single, + .map_page = ehca_dma_map_page, + .unmap_page = ehca_dma_unmap_page, + .map_sg = ehca_dma_map_sg, + .unmap_sg = ehca_dma_unmap_sg, + .dma_address = ehca_dma_address, + .dma_len = ehca_dma_len, + .sync_single_for_cpu = ehca_dma_sync_single_for_cpu, + .sync_single_for_device = ehca_dma_sync_single_for_device, + .alloc_coherent = ehca_dma_alloc_coherent, + .free_coherent = ehca_dma_free_coherent, +}; I always think structures like this are easier to read if you align the '=' signs. But no big deal. + ret = ehca_create_busmap(); + if (ret) { + ehca_gen_err(Cannot create busmap.); + goto module_init2; + } + ret = ibmebus_register_driver(ehca_driver); if (ret) { ehca_gen_err(Cannot register eHCA device driver); ret = -EINVAL; - goto module_init2; + goto module_init3; } ret = register_memory_notifier(ehca_mem_nb); if (ret) { ehca_gen_err(Failed registering memory add/remove notifier); - goto module_init3; + goto module_init4; Having to renumber unrelated things is when something changes is why I don't like this style of error path labels. But I think it's well and truly too late to fix that in ehca. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH 2.6.31 try 2] ehca: Tolerate dynamic memory operations and huge pages
From: Hannes Hering heri...@de.ibm.com This patch implements toleration of dynamic memory operations and 16 GB gigantic pages. Toleration means that the driver can cope with dynamic memory operations that happened before the driver was loaded. While using the ehca driver, dynamic memory operations are still prohibited. On module load the driver walks through available system memory, checks for available memory ranges and then registers the kernel internal memory region accordingly. The translation of address ranges is implemented via a 3-level busmap. Signed-off-by: Hannes Hering heri...@de.ibm.com --- This patch is built and tested against infiniband.git. Please apply for 2.6.31. drivers/infiniband/hw/ehca/ehca_main.c | 20 + drivers/infiniband/hw/ehca/ehca_mrmw.c | 508 - drivers/infiniband/hw/ehca/ehca_mrmw.h | 13 3 files changed, 523 insertions(+), 18 deletions(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_main.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_main.c @@ -52,7 +52,7 @@ #include ehca_tools.h #include hcp_if.h -#define HCAD_VERSION 0026 +#define HCAD_VERSION 0027 MODULE_LICENSE(Dual BSD/GPL); MODULE_AUTHOR(Christoph Raisch rai...@de.ibm.com); @@ -506,6 +506,7 @@ static int ehca_init_device(struct ehca_ shca-ib_device.detach_mcast= ehca_detach_mcast; shca-ib_device.process_mad = ehca_process_mad; shca-ib_device.mmap= ehca_mmap; + shca-ib_device.dma_ops = ehca_dma_mapping_ops; if (EHCA_BMASK_GET(HCA_CAP_SRQ, shca-hca_cap)) { shca-ib_device.uverbs_cmd_mask |= @@ -1028,17 +1029,23 @@ static int __init ehca_module_init(void) goto module_init1; } + ret = ehca_create_busmap(); + if (ret) { + ehca_gen_err(Cannot create busmap.); + goto module_init2; + } + ret = ibmebus_register_driver(ehca_driver); if (ret) { ehca_gen_err(Cannot register eHCA device driver); ret = -EINVAL; - goto module_init2; + goto module_init3; } ret = register_memory_notifier(ehca_mem_nb); if (ret) { ehca_gen_err(Failed registering memory add/remove notifier); - goto module_init3; + goto module_init4; } if (ehca_poll_all_eqs != 1) { @@ -1053,9 +1060,12 @@ static int __init ehca_module_init(void) return 0; -module_init3: +module_init4: ibmebus_unregister_driver(ehca_driver); +module_init3: + ehca_destroy_busmap(); + module_init2: ehca_destroy_slab_caches(); @@ -1073,6 +1083,8 @@ static void __exit ehca_module_exit(void unregister_memory_notifier(ehca_mem_nb); + ehca_destroy_busmap(); + ehca_destroy_slab_caches(); ehca_destroy_comp_pool(); --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_mrmw.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_mrmw.c @@ -53,6 +53,38 @@ /* max number of rpages (per hcall register_rpages) */ #define MAX_RPAGES 512 +/* DMEM toleration management */ +#define EHCA_SECTSHIFTSECTION_SIZE_BITS +#define EHCA_SECTSIZE (1UL EHCA_SECTSHIFT) +#define EHCA_HUGEPAGESHIFT 34 +#define EHCA_HUGEPAGE_SIZE (1UL EHCA_HUGEPAGESHIFT) +#define EHCA_HUGEPAGE_PFN_MASK ((EHCA_HUGEPAGE_SIZE - 1) PAGE_SHIFT) +#define EHCA_INVAL_ADDR0xULL +#define EHCA_DIR_INDEX_SHIFT 13 /* 8k Entries in 64k block */ +#define EHCA_TOP_INDEX_SHIFT (EHCA_DIR_INDEX_SHIFT * 2) +#define EHCA_MAP_ENTRIES (1 EHCA_DIR_INDEX_SHIFT) +#define EHCA_TOP_MAP_SIZE (0x1) /* currently fixed map size */ +#define EHCA_DIR_MAP_SIZE (0x1) +#define EHCA_ENT_MAP_SIZE (0x1) +#define EHCA_INDEX_MASK (EHCA_MAP_ENTRIES - 1) + +static unsigned long ehca_mr_len; + +/* + * Memory map data structures + */ +struct ehca_dir_bmap { + u64 ent[EHCA_MAP_ENTRIES]; +}; +struct ehca_top_bmap { + struct ehca_dir_bmap *dir[EHCA_MAP_ENTRIES]; +}; +struct ehca_bmap { + struct ehca_top_bmap *top[EHCA_MAP_ENTRIES]; +}; + +static struct ehca_bmap *ehca_bmap; + static struct kmem_cache *mr_cache; static struct kmem_cache *mw_cache; @@ -68,6 +100,8 @@ enum ehca_mr_pgsize { #define EHCA_MR_PGSHIFT1M 20 #define EHCA_MR_PGSHIFT16M 24 +static u64 ehca_map_vaddr(void *caddr); + static u32 ehca_encode_hwpage_size(u32 pgsize) { int log = ilog2(pgsize); @@ -135,7 +169,8 @@ struct ib_mr *ehca_get_dma_mr(struct ib_ goto get_dma_mr_exit0; } - ret = ehca_reg_maxmr(shca, e_maxmr, (u64 *)KERNELBASE, + ret = ehca_reg_maxmr(shca, e_maxmr, +(void *)ehca_map_vaddr((void *)KERNELBASE), mr_access_flags, e_pd,
Re: [ewg] Re: [PATCH 1/3] IB/ehca: Replace vmalloc with kmalloc
Hi Roland, did you have a chance to take a look at the patchset and will you apply it, or are there any outstanding issues we need to address? Regards, Alex On Wed, 22 Apr 2009 16:02:28 +0200 Stefan Roscher ossro...@linux.vnet.ibm.com wrote: In case of large queue pairs there is the possibillity of allocation failures due to memory fragmentationo with kmalloc().To ensure the memory is allocated even if kmalloc() can not find chunks which are big enough, we try to allocate the memory with vmalloc(). Signed-off-by: Stefan Roscher stefan.rosc...@de.ibm.com --- On Tuesday 21 April 2009 07:34:30 pm Roland Dreier wrote: +queue-queue_pages = kmalloc(nr_of_pages * sizeof(void *), GFP_KERNEL); How big might this buffer be? Any chance of allocation failure due to memory fragmentation? - R. Hey Roland, yes you are right and here is the patch to circumvent the described problem. It will apply on top of the patchset. regards Stefan drivers/infiniband/hw/ehca/ipz_pt_fn.c | 17 + 1 files changed, 13 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/hw/ehca/ipz_pt_fn.c b/drivers/infiniband/hw/ehca/ipz_pt_fn.c index a260559..1227c59 100644 --- a/drivers/infiniband/hw/ehca/ipz_pt_fn.c +++ b/drivers/infiniband/hw/ehca/ipz_pt_fn.c @@ -222,8 +222,11 @@ int ipz_queue_ctor(struct ehca_pd *pd, struct ipz_queue *queue, /* allocate queue page pointers */ queue-queue_pages = kmalloc(nr_of_pages * sizeof(void *), GFP_KERNEL); if (!queue-queue_pages) { - ehca_gen_err(Couldn't allocate queue page list); - return 0; + queue-queue_pages = vmalloc(nr_of_pages * sizeof(void *)); + if (!queue-queue_pages) { + ehca_gen_err(Couldn't allocate queue page list); + return 0; + } } memset(queue-queue_pages, 0, nr_of_pages * sizeof(void *)); @@ -240,7 +243,10 @@ int ipz_queue_ctor(struct ehca_pd *pd, struct ipz_queue *queue, ipz_queue_ctor_exit0: ehca_gen_err(Couldn't alloc pages queue=%p nr_of_pages=%x, queue, nr_of_pages); - kfree(queue-queue_pages); + if (is_vmalloc_addr(queue-queue_pages)) + vfree(queue-queue_pages); + else + kfree(queue-queue_pages); return 0; } @@ -262,7 +268,10 @@ int ipz_queue_dtor(struct ehca_pd *pd, struct ipz_queue *queue) free_page((unsigned long)queue-queue_pages[i]); } - kfree(queue-queue_pages); + if (is_vmalloc_addr(queue-queue_pages)) + vfree(queue-queue_pages); + else + kfree(queue-queue_pages); return 1; } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [ewg] Re: [PATCH 1/3] IB/ehca: Replace vmalloc with kmalloc
On Tue, 28 Apr 2009 07:01:32 -0700 Roland Dreier rdre...@cisco.com wrote: did you have a chance to take a look at the patchset and will you apply it, or are there any outstanding issues we need to address? I guess it's OK, but definitely 2.6.31 material. I guess I'll stick it linux-next soon. - R. Okay with us, thank you very much! Alex ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] ib/ehca: add flush CQE generation
When a QP goes into error state, it is required that flush CQEs are delivered to the application for any outstanding work requests. eHCA does not do this in hardware, so this patch adds software flush CQE generation to the ehca driver. Whenever a QP gets into error state, it is added to the QP error list of its respective CQ. If the error QP list of a CQ is not empty, poll_cq() generates flush CQEs before polling the actual CQ. Signed-off-by: Alexander Schmidt [EMAIL PROTECTED] --- Applies on top of 2.6.27-rc3, please consider this for 2.6.28. drivers/infiniband/hw/ehca/ehca_classes.h | 14 + drivers/infiniband/hw/ehca/ehca_cq.c |3 drivers/infiniband/hw/ehca/ehca_iverbs.h |2 drivers/infiniband/hw/ehca/ehca_qp.c | 225 -- drivers/infiniband/hw/ehca/ehca_reqs.c| 211 5 files changed, 412 insertions(+), 43 deletions(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_classes.h +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_classes.h @@ -164,6 +164,13 @@ struct ehca_qmap_entry { u16 reported; }; +struct ehca_queue_map { + struct ehca_qmap_entry *map; + unsigned int entries; + unsigned int tail; + unsigned int left_to_poll; +}; + struct ehca_qp { union { struct ib_qp ib_qp; @@ -173,8 +180,9 @@ struct ehca_qp { enum ehca_ext_qp_type ext_type; enum ib_qp_state state; struct ipz_queue ipz_squeue; - struct ehca_qmap_entry *sq_map; + struct ehca_queue_map sq_map; struct ipz_queue ipz_rqueue; + struct ehca_queue_map rq_map; struct h_galpas galpas; u32 qkey; u32 real_qp_num; @@ -204,6 +212,8 @@ struct ehca_qp { atomic_t nr_events; /* events seen */ wait_queue_head_t wait_completion; int mig_armed; + struct list_head sq_err_node; + struct list_head rq_err_node; }; #define IS_SRQ(qp) (qp-ext_type == EQPT_SRQ) @@ -233,6 +243,8 @@ struct ehca_cq { /* mmap counter for resources mapped into user space */ u32 mm_count_queue; u32 mm_count_galpa; + struct list_head sqp_err_list; + struct list_head rqp_err_list; }; enum ehca_mr_flag { --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_reqs.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -53,9 +53,25 @@ /* in RC traffic, insert an empty RDMA READ every this many packets */ #define ACK_CIRC_THRESHOLD 200 +static u64 replace_wr_id(u64 wr_id, u16 idx) +{ + u64 ret; + + ret = wr_id ~QMAP_IDX_MASK; + ret |= idx QMAP_IDX_MASK; + + return ret; +} + +static u16 get_app_wr_id(u64 wr_id) +{ + return wr_id QMAP_IDX_MASK; +} + static inline int ehca_write_rwqe(struct ipz_queue *ipz_rqueue, struct ehca_wqe *wqe_p, - struct ib_recv_wr *recv_wr) + struct ib_recv_wr *recv_wr, + u32 rq_map_idx) { u8 cnt_ds; if (unlikely((recv_wr-num_sge 0) || @@ -69,7 +85,7 @@ static inline int ehca_write_rwqe(struct /* clear wqe header until sglist */ memset(wqe_p, 0, offsetof(struct ehca_wqe, u.ud_av.sg_list)); - wqe_p-work_request_id = recv_wr-wr_id; + wqe_p-work_request_id = replace_wr_id(recv_wr-wr_id, rq_map_idx); wqe_p-nr_of_data_seg = recv_wr-num_sge; for (cnt_ds = 0; cnt_ds recv_wr-num_sge; cnt_ds++) { @@ -146,6 +162,7 @@ static inline int ehca_write_swqe(struct u64 dma_length; struct ehca_av *my_av; u32 remote_qkey = send_wr-wr.ud.remote_qkey; + struct ehca_qmap_entry *qmap_entry = qp-sq_map.map[sq_map_idx]; if (unlikely((send_wr-num_sge 0) || (send_wr-num_sge qp-ipz_squeue.act_nr_of_sg))) { @@ -158,11 +175,10 @@ static inline int ehca_write_swqe(struct /* clear wqe header until sglist */ memset(wqe_p, 0, offsetof(struct ehca_wqe, u.ud_av.sg_list)); - wqe_p-work_request_id = send_wr-wr_id ~QMAP_IDX_MASK; - wqe_p-work_request_id |= sq_map_idx QMAP_IDX_MASK; + wqe_p-work_request_id = replace_wr_id(send_wr-wr_id, sq_map_idx); - qp-sq_map[sq_map_idx].app_wr_id = send_wr-wr_id QMAP_IDX_MASK; - qp-sq_map[sq_map_idx].reported = 0; + qmap_entry-app_wr_id = get_app_wr_id(send_wr-wr_id); + qmap_entry-reported = 0; switch (send_wr-opcode) { case IB_WR_SEND: @@ -496,7 +512,9 @@ static int internal_post_recv(struct ehc struct ehca_wqe *wqe_p; int wqe_cnt = 0; int ret = 0; + u32 rq_map_idx; unsigned long flags; + struct ehca_qmap_entry *qmap_entry; if (unlikely(!HAS_RQ(my_qp))) { ehca_err(dev, QP has no RQ ehca_qp=%p qp_num=%x ext_type=%d, @@ -524,8 +542,15 @@ static int internal_post_recv(struct ehc
cache model of ppc
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dear, I some questions about the PowerPC cache model. How many caches are there? How many caches are in multi-core systems? do you have a L1 cache per cpu and one shared L2? or is there a victim L3 cache? can you tell me something about the path through the cache? for example I look up for data in L1, but this was a miss. Then i look up in L2 this was also a miss. but the date will be in another L1 cache of another cpu. Is there a 1.5 or 2.5 hit? can you help me please? this is for my bachelorthesis thanks a lot felix -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (MingW32) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkisOIwACgkQmH8OAwYoDBk0ugCgu6zzJEQWA9NldC4ekjOcWgIu cYIAoMquvbvAf/qA1WmjtROpq+R1X6Yb =JepB -END PGP SIGNATURE- ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 0/5] ib/ehca: Fix stability issues
Hi Roland, the following patchset contains four small fixes and one bigger patch (5/5) for addressing some ehca issues we found during cluster test. [1/5] update qp_state on cached modify_qp() [2/5] rename goto label in ehca_poll_cq_one() [3/5] repoll on invalid opcode instead of returning success [4/5] check idr_find() return value [5/5] discard double CQE for one WR They all apply on top of 2.6.27-rc1. If possible, we would like to get them into 2.6.27. Regards, Alexander Schmidt ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 2/5] ib/ehca: rename goto label
Rename the poll_cq_one_read_cqe goto label to what it actually does, repoll. Signed-off-by: Alexander Schmidt [EMAIL PROTECTED] --- drivers/infiniband/hw/ehca/ehca_reqs.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_reqs.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -589,7 +589,7 @@ static inline int ehca_poll_cq_one(struc struct ehca_qp *my_qp; int cqe_count = 0, is_error; -poll_cq_one_read_cqe: +repoll: cqe = (struct ehca_cqe *) ipz_qeit_get_inc_valid(my_cq-ipz_queue); if (!cqe) { @@ -617,7 +617,7 @@ poll_cq_one_read_cqe: ehca_dmp(cqe, 64, cq_num=%x qp_num=%x, my_cq-cq_number, cqe-local_qp_number); /* ignore this purged cqe */ - goto poll_cq_one_read_cqe; + goto repoll; } spin_lock_irqsave(qp-spinlock_s, flags); purgeflag = qp-sqerr_purgeflag; @@ -636,7 +636,7 @@ poll_cq_one_read_cqe: * that caused sqe and turn off purge flag */ qp-sqerr_purgeflag = 0; - goto poll_cq_one_read_cqe; + goto repoll; } } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 5/5] ib/ehca: discard double CQE for one WR
Under rare circumstances, the ehca hardware might erroneously generate two CQEs for the same WQE, which is not compliant to the IB spec and will cause unpredictable errors like memory being freed twice. To avoid this problem, the driver needs to detect the second CQE and discard it. For this purpose, introduce an array holding as many elements as the SQ of the QP, called sq_map. Each sq_map entry stores a reported flag for one WQE in the SQ. When a work request is posted to the SQ, the respective reported flag is set to zero. After the arrival of a CQE, the flag is set to 1, which allows to detect the occurence of a second CQE. The mapping between WQE / CQE and the corresponding sq_map element is implemented by replacing the lowest 16 Bits of the wr_id with the index in the queue map. The original 16 Bits are stored in the sq_map entry and are restored when the CQE is passed to the application. Signed-off-by: Alexander Schmidt [EMAIL PROTECTED] --- drivers/infiniband/hw/ehca/ehca_classes.h |9 + drivers/infiniband/hw/ehca/ehca_qes.h |1 drivers/infiniband/hw/ehca/ehca_qp.c | 34 +- drivers/infiniband/hw/ehca/ehca_reqs.c| 54 +++--- 4 files changed, 78 insertions(+), 20 deletions(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_reqs.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -139,6 +139,7 @@ static void trace_send_wr_ud(const struc static inline int ehca_write_swqe(struct ehca_qp *qp, struct ehca_wqe *wqe_p, const struct ib_send_wr *send_wr, + u32 sq_map_idx, int hidden) { u32 idx; @@ -157,7 +158,11 @@ static inline int ehca_write_swqe(struct /* clear wqe header until sglist */ memset(wqe_p, 0, offsetof(struct ehca_wqe, u.ud_av.sg_list)); - wqe_p-work_request_id = send_wr-wr_id; + wqe_p-work_request_id = send_wr-wr_id ~QMAP_IDX_MASK; + wqe_p-work_request_id |= sq_map_idx QMAP_IDX_MASK; + + qp-sq_map[sq_map_idx].app_wr_id = send_wr-wr_id QMAP_IDX_MASK; + qp-sq_map[sq_map_idx].reported = 0; switch (send_wr-opcode) { case IB_WR_SEND: @@ -381,6 +386,7 @@ static inline int post_one_send(struct e { struct ehca_wqe *wqe_p; int ret; + u32 sq_map_idx; u64 start_offset = my_qp-ipz_squeue.current_q_offset; /* get pointer next to free WQE */ @@ -393,8 +399,15 @@ static inline int post_one_send(struct e qp_num=%x, my_qp-ib_qp.qp_num); return -ENOMEM; } + + /* +* Get the index of the WQE in the send queue. The same index is used +* for writing into the sq_map. +*/ + sq_map_idx = start_offset / my_qp-ipz_squeue.qe_size; + /* write a SEND WQE into the QUEUE */ - ret = ehca_write_swqe(my_qp, wqe_p, cur_send_wr, hidden); + ret = ehca_write_swqe(my_qp, wqe_p, cur_send_wr, sq_map_idx, hidden); /* * if something failed, * reset the free entry pointer to the start value @@ -654,8 +667,34 @@ repoll: my_cq, my_cq-cq_number); } - /* we got a completion! */ - wc-wr_id = cqe-work_request_id; + read_lock(ehca_qp_idr_lock); + my_qp = idr_find(ehca_qp_idr, cqe-qp_token); + read_unlock(ehca_qp_idr_lock); + if (!my_qp) + goto repoll; + wc-qp = my_qp-ib_qp; + + if (!(cqe-w_completion_flags WC_SEND_RECEIVE_BIT)) { + struct ehca_qmap_entry *qmap_entry; + /* +* We got a send completion and need to restore the original +* wr_id. +*/ + qmap_entry = my_qp-sq_map[cqe-work_request_id + QMAP_IDX_MASK]; + + if (qmap_entry-reported) { + ehca_warn(cq-device, Double cqe on qp_num=%#x, + my_qp-real_qp_num); + /* found a double cqe, discard it and read next one */ + goto repoll; + } + wc-wr_id = cqe-work_request_id ~QMAP_IDX_MASK; + wc-wr_id |= qmap_entry-app_wr_id; + qmap_entry-reported = 1; + } else + /* We got a receive completion. */ + wc-wr_id = cqe-work_request_id; /* eval ib_wc_opcode */ wc-opcode = ib_wc_opcode[cqe-optype]-1; @@ -678,13 +717,6 @@ repoll: } else wc-status = IB_WC_SUCCESS; - read_lock(ehca_qp_idr_lock); - my_qp = idr_find(ehca_qp_idr, cqe-qp_token); - read_unlock(ehca_qp_idr_lock); - if (!my_qp) - goto repoll; - wc-qp = my_qp-ib_qp; - wc-byte_len = cqe-nr_bytes_transferred; wc-pkey_index = cqe-pkey_index; wc-slid = cqe-rlid
[PATCH 3/5] ib/ehca: repoll on invalid opcode
When the ehca driver detects an invalid opcode in a CQE, it currently passes the CQE to the application and returns with success. This patch changes the CQE handling to discard CQEs with invalid opcodes and to continue reading the next CQE from the CQ. Signed-off-by: Alexander Schmidt [EMAIL PROTECTED] --- drivers/infiniband/hw/ehca/ehca_reqs.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_reqs.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -667,7 +667,7 @@ repoll: ehca_dmp(cqe, 64, ehca_cq=%p cq_num=%x, my_cq, my_cq-cq_number); /* update also queue adder to throw away this entry!!! */ - goto poll_cq_one_exit0; + goto repoll; } /* eval ib_wc_status */ ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 4/5] ib/ehca: check idr_find() return value
The idr_find() function may fail when trying to get the QP that is associated with a CQE, e.g. when a QP has been destroyed between the generation of a CQE and the poll request for it. In consequence, the return value of idr_find() must be checked and the CQE must be discarded when the QP cannot be found. Signed-off-by: Alexander Schmidt [EMAIL PROTECTED] --- drivers/infiniband/hw/ehca/ehca_reqs.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_reqs.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -680,8 +680,10 @@ repoll: read_lock(ehca_qp_idr_lock); my_qp = idr_find(ehca_qp_idr, cqe-qp_token); - wc-qp = my_qp-ib_qp; read_unlock(ehca_qp_idr_lock); + if (!my_qp) + goto repoll; + wc-qp = my_qp-ib_qp; wc-byte_len = cqe-nr_bytes_transferred; wc-pkey_index = cqe-pkey_index; ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 1/5 try2] ib/ehca: update qp_state on cached modify_qp()
Since the introduction of the port auto-detect mode for ehca, calls to modify_qp() may be cached in the device driver when the ports are not activated yet. When a modify_qp() call is cached, the qp state remains untouched until the port is activated, which will leave the qp in the reset state. In the reset state, however, it is not allowed to post SQ WQEs, which confuses applications like ib_mad. The solution for this problem is to immediately set the qp state as requested by modify_qp(), even when the call is cached. Signed-off-by: Alexander Schmidt [EMAIL PROTECTED] --- drivers/infiniband/hw/ehca/ehca_qp.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_qp.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_qp.c @@ -1534,8 +1534,6 @@ static int internal_modify_qp(struct ib_ if (attr_mask IB_QP_QKEY) my_qp-qkey = attr-qkey; - my_qp-state = qp_new_state; - modify_qp_exit2: if (squeue_locked) { /* this means: sqe - rts */ spin_unlock_irqrestore(my_qp-spinlock_s, flags); @@ -1551,6 +1549,8 @@ modify_qp_exit1: int ehca_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr, int attr_mask, struct ib_udata *udata) { + int ret = 0; + struct ehca_shca *shca = container_of(ibqp-device, struct ehca_shca, ib_device); struct ehca_qp *my_qp = container_of(ibqp, struct ehca_qp, ib_qp); @@ -1597,12 +1597,18 @@ int ehca_modify_qp(struct ib_qp *ibqp, s attr-qp_state, my_qp-init_attr.port_num, ibqp-qp_type); spin_unlock_irqrestore(sport-mod_sqp_lock, flags); - return 0; + goto out; } spin_unlock_irqrestore(sport-mod_sqp_lock, flags); } - return internal_modify_qp(ibqp, attr, attr_mask, 0); + ret = internal_modify_qp(ibqp, attr, attr_mask, 0); + +out: + if ((ret == 0) (attr_mask IB_QP_STATE)) + my_qp-state = attr-qp_state; + + return ret; } void ehca_recover_sqp(struct ib_qp *sqp) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 0/5 try2] ib/ehca: Fix stability issues
Hi Roland, Sorry, the first set was mangled because of a broken mailer, so here it is again, double checked... the following patchset contains four small fixes and one bigger patch (5/5) for addressing some ehca issues we found during cluster test. [1/5] update qp_state on cached modify_qp() [2/5] rename goto label in ehca_poll_cq_one() [3/5] repoll on invalid opcode instead of returning success [4/5] check idr_find() return value [5/5] discard double CQE for one WR They all apply on top of 2.6.27-rc1. If possible, we would like to get them into 2.6.27. Regards, Alexander Schmidt ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 4/5 try2] ib/ehca: check idr_find() return value
The idr_find() function may fail when trying to get the QP that is associated with a CQE, e.g. when a QP has been destroyed between the generation of a CQE and the poll request for it. In consequence, the return value of idr_find() must be checked and the CQE must be discarded when the QP cannot be found. Signed-off-by: Alexander Schmidt [EMAIL PROTECTED] --- drivers/infiniband/hw/ehca/ehca_reqs.c |4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_reqs.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -680,8 +680,10 @@ repoll: read_lock(ehca_qp_idr_lock); my_qp = idr_find(ehca_qp_idr, cqe-qp_token); - wc-qp = my_qp-ib_qp; read_unlock(ehca_qp_idr_lock); + if (!my_qp) + goto repoll; + wc-qp = my_qp-ib_qp; wc-byte_len = cqe-nr_bytes_transferred; wc-pkey_index = cqe-pkey_index; ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 3/5 try2] ib/ehca: repoll on invalid opcode
When the ehca driver detects an invalid opcode in a CQE, it currently passes the CQE to the application and returns with success. This patch changes the CQE handling to discard CQEs with invalid opcodes and to continue reading the next CQE from the CQ. Signed-off-by: Alexander Schmidt [EMAIL PROTECTED] --- drivers/infiniband/hw/ehca/ehca_reqs.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_reqs.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -667,7 +667,7 @@ repoll: ehca_dmp(cqe, 64, ehca_cq=%p cq_num=%x, my_cq, my_cq-cq_number); /* update also queue adder to throw away this entry!!! */ - goto poll_cq_one_exit0; + goto repoll; } /* eval ib_wc_status */ ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH 5/5 try2] ib/ehca: discard double CQE for one WR
Under rare circumstances, the ehca hardware might erroneously generate two CQEs for the same WQE, which is not compliant to the IB spec and will cause unpredictable errors like memory being freed twice. To avoid this problem, the driver needs to detect the second CQE and discard it. For this purpose, introduce an array holding as many elements as the SQ of the QP, called sq_map. Each sq_map entry stores a reported flag for one WQE in the SQ. When a work request is posted to the SQ, the respective reported flag is set to zero. After the arrival of a CQE, the flag is set to 1, which allows to detect the occurence of a second CQE. The mapping between WQE / CQE and the corresponding sq_map element is implemented by replacing the lowest 16 Bits of the wr_id with the index in the queue map. The original 16 Bits are stored in the sq_map entry and are restored when the CQE is passed to the application. Signed-off-by: Alexander Schmidt [EMAIL PROTECTED] --- drivers/infiniband/hw/ehca/ehca_classes.h |9 + drivers/infiniband/hw/ehca/ehca_qes.h |1 drivers/infiniband/hw/ehca/ehca_qp.c | 34 +- drivers/infiniband/hw/ehca/ehca_reqs.c| 54 +++--- 4 files changed, 78 insertions(+), 20 deletions(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_reqs.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -139,6 +139,7 @@ static void trace_send_wr_ud(const struc static inline int ehca_write_swqe(struct ehca_qp *qp, struct ehca_wqe *wqe_p, const struct ib_send_wr *send_wr, + u32 sq_map_idx, int hidden) { u32 idx; @@ -157,7 +158,11 @@ static inline int ehca_write_swqe(struct /* clear wqe header until sglist */ memset(wqe_p, 0, offsetof(struct ehca_wqe, u.ud_av.sg_list)); - wqe_p-work_request_id = send_wr-wr_id; + wqe_p-work_request_id = send_wr-wr_id ~QMAP_IDX_MASK; + wqe_p-work_request_id |= sq_map_idx QMAP_IDX_MASK; + + qp-sq_map[sq_map_idx].app_wr_id = send_wr-wr_id QMAP_IDX_MASK; + qp-sq_map[sq_map_idx].reported = 0; switch (send_wr-opcode) { case IB_WR_SEND: @@ -381,6 +386,7 @@ static inline int post_one_send(struct e { struct ehca_wqe *wqe_p; int ret; + u32 sq_map_idx; u64 start_offset = my_qp-ipz_squeue.current_q_offset; /* get pointer next to free WQE */ @@ -393,8 +399,15 @@ static inline int post_one_send(struct e qp_num=%x, my_qp-ib_qp.qp_num); return -ENOMEM; } + + /* +* Get the index of the WQE in the send queue. The same index is used +* for writing into the sq_map. +*/ + sq_map_idx = start_offset / my_qp-ipz_squeue.qe_size; + /* write a SEND WQE into the QUEUE */ - ret = ehca_write_swqe(my_qp, wqe_p, cur_send_wr, hidden); + ret = ehca_write_swqe(my_qp, wqe_p, cur_send_wr, sq_map_idx, hidden); /* * if something failed, * reset the free entry pointer to the start value @@ -654,8 +667,34 @@ repoll: my_cq, my_cq-cq_number); } - /* we got a completion! */ - wc-wr_id = cqe-work_request_id; + read_lock(ehca_qp_idr_lock); + my_qp = idr_find(ehca_qp_idr, cqe-qp_token); + read_unlock(ehca_qp_idr_lock); + if (!my_qp) + goto repoll; + wc-qp = my_qp-ib_qp; + + if (!(cqe-w_completion_flags WC_SEND_RECEIVE_BIT)) { + struct ehca_qmap_entry *qmap_entry; + /* +* We got a send completion and need to restore the original +* wr_id. +*/ + qmap_entry = my_qp-sq_map[cqe-work_request_id + QMAP_IDX_MASK]; + + if (qmap_entry-reported) { + ehca_warn(cq-device, Double cqe on qp_num=%#x, + my_qp-real_qp_num); + /* found a double cqe, discard it and read next one */ + goto repoll; + } + wc-wr_id = cqe-work_request_id ~QMAP_IDX_MASK; + wc-wr_id |= qmap_entry-app_wr_id; + qmap_entry-reported = 1; + } else + /* We got a receive completion. */ + wc-wr_id = cqe-work_request_id; /* eval ib_wc_opcode */ wc-opcode = ib_wc_opcode[cqe-optype]-1; @@ -678,13 +717,6 @@ repoll: } else wc-status = IB_WC_SUCCESS; - read_lock(ehca_qp_idr_lock); - my_qp = idr_find(ehca_qp_idr, cqe-qp_token); - read_unlock(ehca_qp_idr_lock); - if (!my_qp) - goto repoll; - wc-qp = my_qp-ib_qp; - wc-byte_len = cqe-nr_bytes_transferred; wc-pkey_index = cqe-pkey_index; wc-slid
[PATCH 2/5 try2] ib/ehca: rename goto label
Rename the poll_cq_one_read_cqe goto label to what it actually does, repoll. Signed-off-by: Alexander Schmidt [EMAIL PROTECTED] --- drivers/infiniband/hw/ehca/ehca_reqs.c |6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) --- infiniband.git.orig/drivers/infiniband/hw/ehca/ehca_reqs.c +++ infiniband.git/drivers/infiniband/hw/ehca/ehca_reqs.c @@ -589,7 +589,7 @@ static inline int ehca_poll_cq_one(struc struct ehca_qp *my_qp; int cqe_count = 0, is_error; -poll_cq_one_read_cqe: +repoll: cqe = (struct ehca_cqe *) ipz_qeit_get_inc_valid(my_cq-ipz_queue); if (!cqe) { @@ -617,7 +617,7 @@ poll_cq_one_read_cqe: ehca_dmp(cqe, 64, cq_num=%x qp_num=%x, my_cq-cq_number, cqe-local_qp_number); /* ignore this purged cqe */ - goto poll_cq_one_read_cqe; + goto repoll; } spin_lock_irqsave(qp-spinlock_s, flags); purgeflag = qp-sqerr_purgeflag; @@ -636,7 +636,7 @@ poll_cq_one_read_cqe: * that caused sqe and turn off purge flag */ qp-sqerr_purgeflag = 0; - goto poll_cq_one_read_cqe; + goto repoll; } } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH v3] update xmon slb code.
[powerpc] update xmon slb code This adds a bit more detail to the xmon SLB output. When the valid bit is set, This displays the ESID and VSID values, as well as decoding the segment size, (1T or 256M) and displaying the LLP bits. This supresses the output for any slb entries that contain only zeros. sample output from power6 (1T segment support): 00 c800 40004f7ca3000500 1T ESID= c0 VSID= 4f7ca3 LLP:100 01 d800 4000eb71b400 1T ESID= d0 VSID= eb71b0 LLP: 0 08 1800 c8499f8ccc80 256M ESID=1 VSID=c8499f8cc LLP: 0 09 f800 d2c1a8e46c80 256M ESID=f VSID=d2c1a8e46 LLP: 0 10 4800 ca87eab1dc80 256M ESID=4 VSID=ca87eab1d LLP: 0 43 cf000800 400011b26500 1T ESID= cf VSID= 11b260 LLP:100 sample output from power5 (notice the non-valid but non-zero entries) 10 0800 4fd0e077ac80 256M ESID=0 VSID=4fd0e077a LLP: 0 11 f800 5b085830fc80 256M ESID=f VSID=5b085830f LLP: 0 12 4800 52ce99fe6c80 256M ESID=4 VSID=52ce99fe6 LLP: 0 13 1800 50904ed95c80 256M ESID=1 VSID=50904ed95 LLP: 0 14 cf000800 d59aca40f500 256M ESID=cf000 VSID=d59aca40f LLP:100 15 c0007800 45cb97751500 256M ESID=c0007 VSID=45cb97751 LLP:100 Tested on power5 and power6. Signed-Off-By: Will Schmidt [EMAIL PROTECTED] --- This is a resend.. this latest respin is updated to apply on top of Mikeys slb_mmu_size change. (earlier Updates made per comments from Olof and Ben and Paul). This version adds padding around the ESID and VSID fields, and the LLP bits are displayed too. Counting bits, the VSID output looks to be as large as 51 bits, which requires up to 13 spaces. This doesnt count the B field bits which are now masked off the top end of the VSID output. I'll try to follow up sometime later with code that will handle decoding page sizes. I dont have a testcase handy to properly exercise that yet. :-) --- arch/powerpc/xmon/xmon.c | 29 +++-- 1 files changed, 23 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 121b04d..5314db7 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -2539,16 +2539,33 @@ static void xmon_print_symbol(unsigned long address, const char *mid, static void dump_slb(void) { int i; - unsigned long tmp; + unsigned long esid,vsid,valid; + unsigned long llp; printf(SLB contents of cpu %x\n, smp_processor_id()); for (i = 0; i mmu_slb_size; i++) { - asm volatile(slbmfee %0,%1 : =r (tmp) : r (i)); - printf(%02d %016lx , i, tmp); - - asm volatile(slbmfev %0,%1 : =r (tmp) : r (i)); - printf(%016lx\n, tmp); + asm volatile(slbmfee %0,%1 : =r (esid) : r (i)); + asm volatile(slbmfev %0,%1 : =r (vsid) : r (i)); + valid = (esid SLB_ESID_V); + if (valid | esid | vsid) { + printf(%02d %016lx %016lx, i, esid, vsid); + if (valid) { + llp = vsid SLB_VSID_LLP; + if (vsid SLB_VSID_B_1T) { + printf( 1T ESID=%9lx VSID=%13lx LLP:%3lx \n, + GET_ESID_1T(esid), + (vsid ~SLB_VSID_B) SLB_VSID_SHIFT_1T, + llp); + } else { + printf( 256M ESID=%9lx VSID=%13lx LLP:%3lx \n, + GET_ESID(esid), + (vsid ~SLB_VSID_B) SLB_VSID_SHIFT, + llp); + } + } else + printf(\n); + } } } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] powerpc: fix os-term usage on kernel panic
On Fri, 2007-11-30 at 16:56 +1100, Stephen Rothwell wrote: On Tue, 27 Nov 2007 18:15:59 -0600 Will Schmidt [EMAIL PROTECTED] wrote: (resending with the proper from addr this time). I'm seeing some funky behavior on power5/power6 partitions with this patch.A /sbin/reboot is now behaving much more like a /sbin/halt. Anybody else seeing this, or is it time for me to call an exorcist for my boxes? On my Power5+ box, I get an error (code B200A101) logged every time I reboot and about half the time, the machine does not reboot but need to be power cycled. Removing the cited patch makes the error not by logged and reboots work fine. Paul and I have been having a look at this and Paul has asked the architects for clarification of when os-term should be used. The architects will have the correct answer of course. From my reading of the papr, I've got the impression that there are two possibilities. First, the os-term never returns, and it's up to the service processor to do whatever it's going to do. (call home, dump, something else). Nothing we can do there. Second, os-term returns, and it's up to the OS to call into power-off or system-reboot. I think this is more likely. I havn't followed the path back from machine_restart to see exactly how we got there, but probably means a bit more logic to decide whether to call into rtas_restart() or pSeries_power_off() after the call to machine_shutdown. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] powerpc: fix os-term usage on kernel panic
On Wed, 2007-11-28 at 14:18 -0600, Linas Vepstas wrote: On Tue, Nov 27, 2007 at 06:15:59PM -0600, Will Schmidt wrote: (resending with the proper from addr this time). I'm seeing some funky behavior on power5/power6 partitions with this patch.A /sbin/reboot is now behaving much more like a /sbin/halt. Anybody else seeing this, or is it time for me to call an exorcist for my boxes? I beleive the patch http://www.nabble.com/-PATCH--powerpc-pseries:-tell-phyp-to-auto-restart-t4847604.html will cure this problem. It does not. this code is getting called, but still turns the box into a doorstop at /sbin/reboot. It does clear up if I apply this patch, which is a revert of part of your earlier patch. My js2X also turns into a doorstop after /sbin/reboot.. Though I'm not going through a panic path, I wonder if the panic portion is OK and this is what Olaf is hitting. diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index fdeefe5..c9fac5a 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -508,7 +508,7 @@ define_machine(pseries) { .power_off = pSeries_power_off, .halt = rtas_halt, .panic = rtas_panic_msg, - .machine_shutdown = rtas_os_term, +/* .machine_shutdown = rtas_os_term,*/ .get_boot_time = rtas_get_boot_time, .get_rtc_time = rtas_get_rtc_time, .set_rtc_time = rtas_set_rtc_time, ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
pseries (power3) boot hang (pageblock_nr_pages==0)
Hi Folks, I've been seeing a boot hang/crash on power3 systems for a few weeks. (hangs on a 270, drops to SP on a p610). This afternoon I got around to tracking it down to the changes in commit d9c2340052278d8eb2ffb16b0484f8f794def4de Do not depend on MAX_ORDER when grouping pages by mobility cpu 0x0: Vector: 100 (System Reset) at [c0006e803ae0] pc: c009bf50: .setup_per_zone_pages_min+0x298/0x34c lr: c009be38: .setup_per_zone_pages_min+0x180/0x34c [c0006e803e20] c05e3898 .init_per_zone_pages_min+0x80/0xa0 [c0006e803ea0] c05c9c04 .kernel_init+0x214/0x3d8 [c0006e803f90] c0026cac .kernel_thread+0x4c/0x68 I narrowed it down to the for loop within setup_zone_migrate_reserve(), called by setup_per_zone_pages_min(). The loop spins forever due to pageblock_nr_pages being 0. I imagine this would be properly fixed with something similar to the change for iSeries. Depending on how obvious, quick and easy it is for the experts to come up with a proper fix, I'll be able to do additional debug and hacking after turkey-day. :-) For the moment, I've hacked it with the following patch. (tested on both the 270 and the p610): --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2454,6 +2454,9 @@ static void setup_zone_migrate_reserve(struct zone *zone) reserve = roundup(zone-pages_min, pageblock_nr_pages) pageblock_order; +/* this is a cheap and dirty bailout, probally not a proper fix. */ + if (pageblock_nr_pages==0) return; + for (pfn = start_pfn; pfn end_pfn; pfn += pageblock_nr_pages) { if (!pfn_valid(pfn)) continue; ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] [powerpc v3] update xmon slb code
[powerpc] update xmon slb code This adds a bit more detail to the xmon SLB output. When the valid bit is set, This displays the ESID and VSID values, as well as decoding the segment size, (1T or 256M) and displaying the LLP bits. This supresses the output for any slb entries that contain only zeros. sample output from power6 (1T segment support): 00 c800 40004f7ca3000500 1T ESID= c0 VSID= 4f7ca3 LLP:100 01 d800 4000eb71b400 1T ESID= d0 VSID= eb71b0 LLP: 0 08 1800 c8499f8ccc80 256M ESID=1 VSID=c8499f8cc LLP: 0 09 f800 d2c1a8e46c80 256M ESID=f VSID=d2c1a8e46 LLP: 0 10 4800 ca87eab1dc80 256M ESID=4 VSID=ca87eab1d LLP: 0 43 cf000800 400011b26500 1T ESID= cf VSID= 11b260 LLP:100 sample output from power5 (notice the non-valid but non-zero entries) 10 0800 4fd0e077ac80 256M ESID=0 VSID=4fd0e077a LLP: 0 11 f800 5b085830fc80 256M ESID=f VSID=5b085830f LLP: 0 12 4800 52ce99fe6c80 256M ESID=4 VSID=52ce99fe6 LLP: 0 13 1800 50904ed95c80 256M ESID=1 VSID=50904ed95 LLP: 0 14 cf000800 d59aca40f500 256M ESID=cf000 VSID=d59aca40f LLP:100 15 c0007800 45cb97751500 256M ESID=c0007 VSID=45cb97751 LLP:100 Tested on power5 and power6. Signed-Off-By: Will Schmidt [EMAIL PROTECTED] --- (Updates made per comments from Olof and Ben and Paul). This version adds padding around the ESID and VSID fields, and the LLP bits are displayed too. Counting bits, the VSID output looks to be as large as 51 bits, which requires up to 13 spaces. This doesnt count the B field bits which are now masked off the top end of the VSID output. I'll try to follow up sometime later with code that will handle decoding page sizes. I dont have a testcase handy to properly exercise that yet. :-) --- arch/powerpc/xmon/xmon.c | 29 +++-- 1 files changed, 23 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 121b04d..5314db7 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -2527,16 +2527,33 @@ static void xmon_print_symbol(unsigned long address, const char *mid, static void dump_slb(void) { int i; - unsigned long tmp; + unsigned long esid,vsid,valid; + unsigned long llp; printf(SLB contents of cpu %x\n, smp_processor_id()); for (i = 0; i SLB_NUM_ENTRIES; i++) { - asm volatile(slbmfee %0,%1 : =r (tmp) : r (i)); - printf(%02d %016lx , i, tmp); - - asm volatile(slbmfev %0,%1 : =r (tmp) : r (i)); - printf(%016lx\n, tmp); + asm volatile(slbmfee %0,%1 : =r (esid) : r (i)); + asm volatile(slbmfev %0,%1 : =r (vsid) : r (i)); + valid = (esid SLB_ESID_V); + if (valid | esid | vsid) { + printf(%02d %016lx %016lx, i, esid, vsid); + if (valid) { + llp = vsid SLB_VSID_LLP; + if (vsid SLB_VSID_B_1T) { + printf( 1T ESID=%9lx VSID=%13lx LLP:%3lx \n, + GET_ESID_1T(esid), + (vsid ~SLB_VSID_B) SLB_VSID_SHIFT_1T, + llp); + } else { + printf( 256M ESID=%9lx VSID=%13lx LLP:%3lx \n, + GET_ESID(esid), + (vsid ~SLB_VSID_B) SLB_VSID_SHIFT, + llp); + } + } else + printf(\n); + } } } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] [powerpc v2] update xmon slb code
[powerpc] update xmon slb code This adds a bit more detail to the xmon SLB output. When the valid bit is set, This displays the ESID and VSID values, as well as decoding the segment size. (1T or 256M). This supresses the output for any slb entries that contain only zeros. sample output from power6 (1T segment support): 00 c800 40004f7ca3000500 1T ESID= c0 VSID=40004f7ca3 LLP bits:100 01 d800 4000eb71b400 1T ESID= d0 VSID=4000eb71b0 LLP bits: 0 03 0800 628021c6ac80 256M ESID=0 VSID= 628021c6a LLP bits: 0 04 0f000800 400095c1e8000c80 1T ESID=f VSID=400095c1e8 LLP bits: 0 22 cf000800 400011b26500 1T ESID= cf VSID=400011b260 LLP bits:100 62 04000800 40005d488d000c80 1T ESID=4 VSID=40005d488d LLP bits: 0 63 1800 633f90285c80 256M ESID=1 VSID= 633f90285 LLP bits: 0 sample output from power5 (notice the non-valid but non-zero entries) 00 c800 408f92c94500 256M ESID=c VSID= 408f92c94 LLP bits:100 01 d800 f09b89af5400 256M ESID=d VSID= f09b89af5 LLP bits: 0 03 1000 136eafb0bc80 11 0800 5928811f2c80 256M ESID=0 VSID= 5928811f2 LLP bits: 0 12 f800 645ff8d87c80 256M ESID=f VSID= 645ff8d87 LLP bits: 0 13 4800 5c263aa5ec80 256M ESID=4 VSID= 5c263aa5e LLP bits: 0 14 1800 59e7ef80dc80 256M ESID=1 VSID= 59e7ef80d LLP bits: 0 15 1000 59e7ef80dc80 Tested on power5 and power6. Signed-Off-By: Will Schmidt [EMAIL PROTECTED] --- This version adds padding around the ESID and VSID fields, and the LLP bits are displayed too. (Per request from Olof and Ben). I'll try to follow up sometime later with code that will handle decoding page sizes. I dont have a testcase handy to properly exercise that yet. :-) --- arch/powerpc/xmon/xmon.c | 27 +-- 1 files changed, 21 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 121b04d..93c26c3 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -2527,16 +2527,31 @@ static void xmon_print_symbol(unsigned long address, const char *mid, static void dump_slb(void) { int i; - unsigned long tmp; + unsigned long esid,vsid,valid; + unsigned long llp_bits; printf(SLB contents of cpu %x\n, smp_processor_id()); for (i = 0; i SLB_NUM_ENTRIES; i++) { - asm volatile(slbmfee %0,%1 : =r (tmp) : r (i)); - printf(%02d %016lx , i, tmp); - - asm volatile(slbmfev %0,%1 : =r (tmp) : r (i)); - printf(%016lx\n, tmp); + asm volatile(slbmfee %0,%1 : =r (esid) : r (i)); + asm volatile(slbmfev %0,%1 : =r (vsid) : r (i)); + valid = (esid SLB_ESID_V); + if (valid | esid | vsid) { + printf(%02d %016lx %016lx, i, esid, vsid); + if (valid) { + llp_bits = vsid SLB_VSID_LLP; + if (vsid SLB_VSID_B_1T) { + printf( 1T ESID=%9lx VSID=%10lx LLP bits:%3lx \n, + GET_ESID_1T(esid),vsid SLB_VSID_SHIFT_1T, + llp_bits); + } else { + printf( 256M ESID=%9lx VSID=%10lx LLP bits:%3lx \n, + GET_ESID(esid),vsid SLB_VSID_SHIFT, + llp_bits); + } + } else + printf(\n); + } } } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] [Powerpc V2] fix switch_slb handling of 1T ESID values
[Powerpc V2] fix switch_slb handling of 1T ESID values Now that we have 1TB segment size support, we need to be using the GET_ESID_1T macro when comparing ESID values for pc,stack, and unmapped_base within switch_slb().A new helper function called esids_match() contains the logic for deciding when to call GET_ESID and GET_ESID_1T. This also happens to fix a duplicate-slb-entry inspired machine-check exception I was seeing when trying to run java on a power6 partition. Tested on power6 and power5. Signed-Off-By: Will Schmidt [EMAIL PROTECTED] --- arch/powerpc/mm/slb.c | 33 ++--- 1 files changed, 30 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c index bbd2c51..152f4cd 100644 --- a/arch/powerpc/mm/slb.c +++ b/arch/powerpc/mm/slb.c @@ -148,6 +148,34 @@ void slb_vmalloc_update(void) slb_flush_and_rebolt(); } +/* Helper function to compare esids. There are four cases to handle. + * 1. The system is not 1T segment size capable. Use the GET_ESID compare. + * 2. The system is 1T capable, both addresses are 1T, use the GET_ESID compare. + * 3. The system is 1T capable, only one of the two addresses is 1T. This is not a match. + * 4. The system is 1T capable, both addresses are 1T, use the GET_ESID_1T macro to compare. + */ +static inline int esids_match(unsigned long addr1, unsigned long addr2) +{ + int esid_1t_count; + + /* System is not 1T segment size capable. */ + if (!cpu_has_feature(CPU_FTR_1T_SEGMENT)) + return (GET_ESID(addr1) == GET_ESID(addr2)); + + esid_1t_count = (((addr1SID_SHIFT_1T)!=0) + ((addr2SID_SHIFT_1T)!=0)); + + /* both addresses are 1T */ + if (esid_1t_count == 0) + return (GET_ESID(addr1) == GET_ESID(addr2)); + + /* One address 1T, the other 1T. Not a match */ + if (esid_1t_count == 1) + return 0; + + /* Both addresses are 1T. */ + return (GET_ESID_1T(addr1) == GET_ESID_1T(addr2)); +} + /* Flush all user entries from the segment table of the current processor. */ void switch_slb(struct task_struct *tsk, struct mm_struct *mm) { @@ -193,15 +221,14 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm) return; slb_allocate(pc); - if (GET_ESID(pc) == GET_ESID(stack)) + if (esids_match(pc,stack)) return; if (is_kernel_addr(stack)) return; slb_allocate(stack); - if ((GET_ESID(pc) == GET_ESID(unmapped_base)) - || (GET_ESID(stack) == GET_ESID(unmapped_base))) + if (esids_match(pc,unmapped_base) || esids_match(stack,unmapped_base)) return; if (is_kernel_addr(unmapped_base)) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] [Powerpc] include udbg.h when using udbg_printf
[Powerpc] include udbg.h when using udbg_printf this fixes the error error: implicit declaration of function ‘udbg_printf’ We have a few spots where we reference udbg_printf() without #including udbg.h. These are within #ifdef DEBUG blocks, so unnoticed until we do a #define DEBUG or #define DEBUG_LOW nearby. Signed-Off-By: Will Schmidt [EMAIL PROTECTED] --- arch/powerpc/mm/hash_utils_64.c |1 + arch/powerpc/mm/slb.c |1 + arch/powerpc/platforms/cell/smp.c |1 + arch/powerpc/platforms/pseries/firmware.c |1 + 4 files changed, 4 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index c78dc91..94b8ca0 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -51,6 +51,7 @@ #include asm/cputable.h #include asm/sections.h #include asm/spu.h +#include asm/udbg.h #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c index bbd2c51..637afb2 100644 --- a/arch/powerpc/mm/slb.c +++ b/arch/powerpc/mm/slb.c @@ -25,6 +25,7 @@ #include asm/smp.h #include asm/firmware.h #include linux/compiler.h +#include asm/udbg.h #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) diff --git a/arch/powerpc/platforms/cell/smp.c b/arch/powerpc/platforms/cell/smp.c index 1c0acba..e443845 100644 --- a/arch/powerpc/platforms/cell/smp.c +++ b/arch/powerpc/platforms/cell/smp.c @@ -44,6 +44,7 @@ #include asm/rtas.h #include interrupt.h +#include asm/udbg.h #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c index 8b18a1c..b765b7c 100644 --- a/arch/powerpc/platforms/pseries/firmware.c +++ b/arch/powerpc/platforms/pseries/firmware.c @@ -25,6 +25,7 @@ #include asm/firmware.h #include asm/prom.h +#include asm/udbg.h #ifdef DEBUG #define DBG(fmt...) udbg_printf(fmt) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] [powerpc] update xmon slb code
[powerpc] update xmon slb code adds a bit more detail to the xmon SLB output. When the valid bit is set, This displays the ESID and VSID values, as well as decoding the segment size. (1T or 256M). This supresses the output for any slb entries that contain only zeros. I debated a bit on whether to check for just (valid) versus checking for (valid|esid|vsid). By inspection on power5, I do have SLB entries that contain values but without the valid bit set, so opted to display any non-zero values. sample output from power6 (1T segment support): 00 c800 40004f7ca3000500 1T ESID=c0 VSID=40004f7ca3 01 d800 4000eb71b400 1T ESID=d0 VSID=4000eb71b0 24 cf000800 400011b26500 1T ESID=cf VSID=400011b260 25 04000800 4000a9e949000c80 1T ESID=4 VSID=4000a9e949 26 1800 5e93bfd49c80 256M ESID=1 VSID=5e93bfd49 27 0f000800 4000e262a4000c80 1T ESID=f VSID=4000e262a4 28 0800 5dd45172ec80 256M ESID=0 VSID=5dd45172e sample output from power5 (notice the non-valid but non-zero entries) 54 4800 cf33bb059c80 256M ESID=4 VSID=cf33bb059 55 1800 ccf56fe08c80 256M ESID=1 VSID=ccf56fe08 56 1000 dd82ce799c80 57 cf000800 d59aca40f500 256M ESID=cf000 VSID=d59aca40f 58 c0007800 45cb97751500 256M ESID=c0007 VSID=45cb97751 59 0400 61552db1bc80 Tested on power5 and power6. Signed-Off-By: Will Schmidt [EMAIL PROTECTED] --- arch/powerpc/xmon/xmon.c | 20 ++-- 1 files changed, 14 insertions(+), 6 deletions(-) diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c index 121b04d..97984f3 100644 --- a/arch/powerpc/xmon/xmon.c +++ b/arch/powerpc/xmon/xmon.c @@ -2527,16 +2527,24 @@ static void xmon_print_symbol(unsigned long address, const char *mid, static void dump_slb(void) { int i; - unsigned long tmp; + unsigned long esid,vsid,valid; printf(SLB contents of cpu %x\n, smp_processor_id()); for (i = 0; i SLB_NUM_ENTRIES; i++) { - asm volatile(slbmfee %0,%1 : =r (tmp) : r (i)); - printf(%02d %016lx , i, tmp); - - asm volatile(slbmfev %0,%1 : =r (tmp) : r (i)); - printf(%016lx\n, tmp); + asm volatile(slbmfee %0,%1 : =r (esid) : r (i)); + asm volatile(slbmfev %0,%1 : =r (vsid) : r (i)); + valid = (esid SLB_ESID_V); + if (valid | esid | vsid) { + printf(%02d %016lx %016lx, i, esid, vsid); + if (valid) { + if (vsid SLB_VSID_B_1T) + printf( 1T ESID=%lx VSID=%lx \n, GET_ESID_1T(esid),vsid SLB_VSID_SHIFT_1T); + else + printf( 256M ESID=%lx VSID=%lx \n, GET_ESID(esid),vsid SLB_VSID_SHIFT); + } else + printf(\n); + } } } ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] [Powerpc] fix switch_slb handling of 1T ESID values
On Sat, 2007-10-27 at 14:19 +1000, Benjamin Herrenschmidt wrote: On Fri, 2007-10-26 at 15:46 -0500, Will Schmidt wrote: [Powerpc] fix switch_slb handling of 1T ESID values Now that we have 1TB segment size support, we need to be using the GET_ESID_1T macro when comparing ESID values for pc,stack, and unmapped_base within switch_slb() when we're on a CPU that supports it. This also happens to fix a duplicate-slb-entry inspired machine-check exception I was seeing when trying to run java on a power6 partition. Tested on power6 and power5. Signed-Off-By: Will Schmidt [EMAIL PROTECTED] Good catch ! A minor comment is maybe you could factor out the code better doing something like a ESID_COMPARE() macro ? Yeah, thats a good idea. I'll spin up a new patch in the next day or so. It occurred to me that I should continue to use GET_ESID when the user address is 1T too. --- There is a similar bit of code in stab.c switch_stab(). Should this change also be made there? --- There is no machine that does stab and 1T segments. Ok, thanks for the clarification. -Will Ben. arch/powerpc/mm/slb.c | 19 ++- 1 files changed, 14 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c index bbd2c51..0c527d7 100644 --- a/arch/powerpc/mm/slb.c +++ b/arch/powerpc/mm/slb.c @@ -193,16 +193,25 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm) return; slb_allocate(pc); - if (GET_ESID(pc) == GET_ESID(stack)) - return; + if (cpu_has_feature(CPU_FTR_1T_SEGMENT)) { + if (GET_ESID_1T(pc) == GET_ESID_1T(stack)) + return; + } else + if (GET_ESID(pc) == GET_ESID(stack)) + return; if (is_kernel_addr(stack)) return; slb_allocate(stack); - if ((GET_ESID(pc) == GET_ESID(unmapped_base)) - || (GET_ESID(stack) == GET_ESID(unmapped_base))) - return; + if (cpu_has_feature(CPU_FTR_1T_SEGMENT)) { + if ((GET_ESID_1T(pc) == GET_ESID_1T(unmapped_base)) + || (GET_ESID_1T(stack) == GET_ESID_1T(unmapped_base))) + return; + } else + if ((GET_ESID(pc) == GET_ESID(unmapped_base)) + || (GET_ESID(stack) == GET_ESID(unmapped_base))) + return; if (is_kernel_addr(unmapped_base)) return; ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
[PATCH] [Powerpc] fix switch_slb handling of 1T ESID values
[Powerpc] fix switch_slb handling of 1T ESID values Now that we have 1TB segment size support, we need to be using the GET_ESID_1T macro when comparing ESID values for pc,stack, and unmapped_base within switch_slb() when we're on a CPU that supports it. This also happens to fix a duplicate-slb-entry inspired machine-check exception I was seeing when trying to run java on a power6 partition. Tested on power6 and power5. Signed-Off-By: Will Schmidt [EMAIL PROTECTED] --- There is a similar bit of code in stab.c switch_stab(). Should this change also be made there? --- arch/powerpc/mm/slb.c | 19 ++- 1 files changed, 14 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/mm/slb.c b/arch/powerpc/mm/slb.c index bbd2c51..0c527d7 100644 --- a/arch/powerpc/mm/slb.c +++ b/arch/powerpc/mm/slb.c @@ -193,16 +193,25 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm) return; slb_allocate(pc); - if (GET_ESID(pc) == GET_ESID(stack)) - return; + if (cpu_has_feature(CPU_FTR_1T_SEGMENT)) { + if (GET_ESID_1T(pc) == GET_ESID_1T(stack)) + return; + } else + if (GET_ESID(pc) == GET_ESID(stack)) + return; if (is_kernel_addr(stack)) return; slb_allocate(stack); - if ((GET_ESID(pc) == GET_ESID(unmapped_base)) - || (GET_ESID(stack) == GET_ESID(unmapped_base))) - return; + if (cpu_has_feature(CPU_FTR_1T_SEGMENT)) { + if ((GET_ESID_1T(pc) == GET_ESID_1T(unmapped_base)) + || (GET_ESID_1T(stack) == GET_ESID_1T(unmapped_base))) + return; + } else + if ((GET_ESID(pc) == GET_ESID(unmapped_base)) + || (GET_ESID(stack) == GET_ESID(unmapped_base))) + return; if (is_kernel_addr(unmapped_base)) return; ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [bug] block subsystem related crash on Legacy iSeries viodasd.c
On Sun, 2007-10-21 at 14:44 +0200, Jens Axboe wrote: On Fri, Oct 19 2007, Will Schmidt wrote: Hi Jens, Stephen, and Everyone else. ... You need this, will remember to fix that up for the new branch as well. diff --git a/drivers/block/viodasd.c b/drivers/block/viodasd.c index e824b67..2ce3622 100644 --- a/drivers/block/viodasd.c +++ b/drivers/block/viodasd.c @@ -270,6 +270,7 @@ static int send_request(struct request *req) d = req-rq_disk-private_data; /* Now build the scatter-gather list */ + memset(sg, 0, sizeof(sg)); nsg = blk_rq_map_sg(req-q, req, sg); nsg = dma_map_sg(d-dev, sg, nsg, direction); That appears to do the trick. Thanks! Tested-By: Will Schmidt [EMAIL PROTECTED] ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] Use 1TB segments
On Wed, 2007-10-03 at 13:13 +1000, Paul Mackerras wrote: Will Schmidt writes: I still need to test this code for performance issues, and this version could still use some cosmetic touchups, so I dont think we want this to go into a tree yet. I am reposting this primarily to indicate the prior version isnt quite right, and so Jon can rebase to this version. :-) The way we scan the ibm,processor-segment-sizes property could be nicer. Where there any other cosmetic touchups you were thinking of, and if so what were they? I didn't notice any leftover debugging printks or anything else that obviously needed cleaning up. Correct.. nothing in the patch really *needs* to be cleaned up. This is mostly me being way more nit-picky than I need to be. :-) I don't have any real issues with the patch (being candidate for) going into a tree. The only obvious is the MMU_SEGSIZE_* #define's in mmu-hash64.h appear to be duplicated. The rest I can follow up on later, none of it affects the code outside of #ifdef DEBUG's and it should be a separate patch anyway. Thanks, -Will Paul. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] Use 1TB segments
Hi Paul, just a few questions. On Wed, 2007-08-01 at 12:04 +1000, Paul Mackerras wrote: This makes the kernel use 1TB segments for all kernel mappings and for user addresses of 1TB and above, on machines which support them (currently POWER5+ and POWER6). We don't currently use 1TB segments for user addresses 1T, since that would effectively prevent 32-bit processes from using huge pages unless we also had a way to revert to using 256MB segments. I think I have a question about user address 1T.. once I think on it a bit more it'll either click, or I'll have a question articulated. :-) -static inline void __tlbiel(unsigned long va, unsigned int psize) +static inline void __tlbiel(unsigned long va, int psize, int ssize) -static inline void tlbie(unsigned long va, int psize, int local) +static inline void tlbie(unsigned long va, int psize, int ssize, int local) static long native_hpte_insert(unsigned long hpte_group, unsigned long va, unsigned long pa, unsigned long rflags, - unsigned long vflags, int psize) + unsigned long vflags, int psize, int ssize) static long native_hpte_updatepp(unsigned long slot, unsigned long newpp, - unsigned long va, int psize, int local) + unsigned long va, int psize, int ssize, + int local) Is there technical reason why the 'local' variable remains at the end of the parm list for these? In other cases 'ssize' simply gets added to the end of the parm list. +static int __init htab_dt_scan_seg_sizes(unsigned long node, + const char *uname, int depth, + void *data) +{ + char *type = of_get_flat_dt_prop(node, device_type, NULL); + u32 *prop; + unsigned long size = 0; + + /* We are scanning cpu nodes only */ + if (type == NULL || strcmp(type, cpu) != 0) + return 0; + + prop = (u32 *)of_get_flat_dt_prop(node, ibm,processor-segment-sizes, + size); + if (prop != NULL size = 8) { + if (prop[0] == 0x1c prop[1] == 0x28) { This is 0x1c indicating 2^28 for 256M; and 0x28 indicating 2^40 for 1TB segments. Will there ever be a segment size between the two? Or will the representation every vary from this? i.e. wondering if prop[0] will always be for 256M and prop[1] for 1TB. +#define slb_vsid_shift(ssize)\ + ((ssize) == MMU_SEGSIZE_256M? SLB_VSID_SHIFT: SLB_VSID_SHIFT_1T) @@ -100,12 +106,13 @@ void slb_flush_and_rebolt(void) vflags = SLB_VSID_KERNEL | vmalloc_llp; ksp_esid_data = mk_esid_data(get_paca()-kstack, 2); - if ((ksp_esid_data ESID_MASK) == PAGE_OFFSET) + mask = (mmu_kernel_ssize == MMU_SEGSIZE_256M)? ESID_MASK: ESID_MASK_1T; Is this one worthy of a #define like the slb_vsid_shift() above? +#define VSID_MULTIPLIER_256M ASM_CONST(200730139)/* 28-bit prime */ +#define VSID_MULTIPLIER_1T ASM_CONST(12538073) /* 24-bit prime */ Anything special in how this 24-bit prime value was selected? (same question could be for the 28-bit prime, though I see that value was updated at least once a few years back) -Will ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] Use 1TB segments
On Fri, 2007-08-03 at 08:37 +1000, Benjamin Herrenschmidt wrote: Is there technical reason why the 'local' variable remains at the end of the parm list for these? In other cases 'ssize' simply gets added to the end of the parm list. Looks nicer to have psize and ssize together :-) Aah! And here I thought there was some obscure register usage optimization going on.. :-) -Will Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] fixes for the SLB shadow buffer
On Wed, 2007-08-01 at 16:02 +1000, Michael Neuling wrote: --- linux-2.6-ozlabs.orig/arch/powerpc/kernel/entry_64.S +++ linux-2.6-ozlabs/arch/powerpc/kernel/entry_64.S @@ -389,7 +389,9 @@ BEGIN_FTR_SECTION ld r9,PACA_SLBSHADOWPTR(r13) li r12,0 std r12,SLBSHADOW_STACKESID(r9) /* Clear ESID */ + eieio std r7,SLBSHADOW_STACKVSID(r9) /* Save VSID */ + eieio std r0,SLBSHADOW_STACKESID(r9) /* Save ESID */ Hi Michael, I was going to ask if we really needed both of them, but think I convinced myself that we do. Do we also want/need an eieio after the /* Save ESID */ statement, or is that somehow handled by the slbie following? -Will ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: [PATCH] [POWERPC] iSeries: fix section mismatch warnings
On Thu, 2007-07-26 at 11:56 +1000, Stephen Rothwell wrote: Hi Will, On Wed, 25 Jul 2007 11:55:31 -0500 Will Schmidt [EMAIL PROTECTED] wrote: cmpwi 0,r24,0 /* Are we processor 0? */ - beq .__start_initialization_iSeries /* Start up the first processor */ - mfspr r4,SPRN_CTRLF + bne 1f + b .__start_initialization_iSeries /* Start up the first processor */ +1: mfspr r4,SPRN_CTRLF li r5,CTRL_RUNLATCH/* Turn off the run light */ This part isnt clicking for me.. How does changing a beq to a bne over a b fit into changing __start_initialization_iSeries static? Because I moved __start_initialization_iSeries into another section, it ends up too far away for a conditional branch so something adds a jump table to the .text section and changes this branch to be via that table. Unfortunately, the jump table ends up at the start of the .text and ruins our carefully laid out kernel image. By inverting the test I can turn the branch into an unconditional one which has a larger possible offse (effectively building the jump table manually). Gotcha, thanks for the clarification. :-) -Will ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev