from:"Benjamin Herrenschmidt"

Re: [PATCH v4 5/5] drm/ofdrm: Support big-endian scanout buffers

2022-10-17 Thread Benjamin Herrenschmidt

On Thu, 2022-10-13 at 09:39 +0200, Javier Martinez Canillas wrote:
> > In absence of such test results I think the most reasonable thing is to
> > keep the logic that nobody complained about for 10+ years.
> > 
> 
> I agree with Michal and Thomas on this. I don't see a strong reason to not
> use the same heuristic that the offb fbdev driver has been doing for this.
> 
> Otherwise if this turns out to be needed, it will cause a regression for a
> user that switches to this driver instead. Specially since both fbdev and
> DRM drivers match against the same "display" OF compatible string.

I agree.

In the end, what it boils down to is, we don't know, we should guess. The
endianness of the kernel is as good a guess as anything here.

If not that, then assume BE.

That code was originally written for old macs. Those could simply not boot
anything other than a BE kernel. OF would always setup a 8bpp fb (so endianness
is irrelevant) but BootX could setup something else.

There's almost no old LE powerpc system out there, and I'm reasonably sure
actually none of any relevance to this code.

That leaves us with newer systems capable of endian switching. Those should just
get the property added.

I don't know of many cases out there. There' SLOS on powerpc which still won't
set it (which is what qemu uses). That could just be fixed. In the meantime
it makes sense for the kernel to follow its existing behaviour.

Cheers,
Ben.

Re: [PATCH v2 2/2] powerpc/pci: Prefer PCI domain assignment via DT 'linux,pci-domain' and alias

2022-09-22 Thread Benjamin Herrenschmidt

On Mon, 2022-08-15 at 15:46 +1000, Michael Ellerman wrote:
> Guenter Roeck  writes:
> > On Wed, Jul 06, 2022 at 12:21:48PM +0200, Pali Rohár wrote:
> > > Other Linux architectures use DT property 'linux,pci-domain' for 
> > > specifying
> > > fixed PCI domain of PCI controller specified in Device-Tree.
> > > 
> > > And lot of Freescale powerpc boards have defined numbered pci alias in
> > > Device-Tree for every PCIe controller which number specify preferred PCI
> > > domain.
> > > 
> > > So prefer usage of DT property 'linux,pci-domain' (via function
> > > of_get_pci_domain_nr()) and DT pci alias (via function of_alias_get_id())
> > > on powerpc architecture for assigning PCI domain to PCI controller.
> > > 
> > > Fixes: 63a72284b159 ("powerpc/pci: Assign fixed PHB number based on 
> > > device-tree properties")
> > > Signed-off-by: Pali Rohár 
> > 
> > This patch results in a number of boot warnings with various qemu
> > boot tests.
> 
> Thanks for the report.
> 
> I have automated qemu boot tests to catch things like this, they even
> have DEBUG_ATOMIC_SLEEP enabled ... but I inadvertantly broke my script
> that checks for "BUG:" in the console log. Sometimes you just can't
> win.

So the problem is

spin_lock(_spinlock);

get_phb_number() relies on it for the phb_bitmap allocation. You can
move it out of the lock but you'll have to either:

 - Take the lock inside it to protect the allocation

 - Turn find_first_zero_bit/set_bit into a loop of
find_first_zero_bit+test_and_set_bit() which wouldn't require a lock.

Note about the other "reg" numbering conversation ... I'm pretty sure
that breaks some old PowerMac crap which shows nobody really uses these
things considering how long the patch has been there :-)

I'm pretty sure something somewhere assumes the primary bus is 0. Some
old userspace definitely does though that might no longer be relevant.
The whole business with "domain 0" being special and avoiding domain
numbers in /proc relies on this too...

Cheers,
Ben.

Re: [PATCH] powerpc/pci: Enable PCI domains in /proc when PCI bus numbers are not unique

2022-09-22 Thread Benjamin Herrenschmidt

On Thu, 2022-09-01 at 13:53 +1000, Michael Ellerman wrote:
> > 
> > I sent two patches which do another steps to achieve it:
> > https://lore.kernel.org/linuxppc-dev/20220817163927.24453-1-p...@kernel.org/t/#u
> > 
> > Main blocker is pci-OF-bus-map which is in direct conflict with
> > CONFIG_PPC_PCI_BUS_NUM_DOMAIN_DEPENDENT and which used on chrp and pmac.
> > And I have no idea if pci-OF-bus-map is still needed or not.
> 
> Yeah thanks, I saw those patches.
> 
> I can't find any code that refers to pci-OF-bus-map, so I'm inclined to
> remove it entirely.
> 
> But I'll do some more searching to see if I can find any references to
> it in old code.

Trying to remember ... :-)

So this is what I recall at this point:

 - Ancient X11 didn't understand domains in /proc and thus would barf,
which was the primary reason for not enabling them always iirc...

 - There might be something else with early PowerMacs (Grand Central
chipset) where we have effectively two domains (gc and chaos) but
overlapping bus numbers. There might still be pre-historical code in
there that assumes it's that way though I can't see anything obvious.
Paul might still have one of these :-) (PowerMac 7200/7500/8500/9500
afaik).

 - pci-OF-bus-map predates the PCI layer keeping track of the PCI/OF
relationship. I don't believe it's still used anywhere in the kernel,
though it's possible (unlikely ?) that some garbage remains in
userspace that does.

At this point, I wouldn't object to tearing this all out and just
having domains always (and see what the fallout is).

Cheers,
Ben.

Re: [PATCH v2 10/10] drm/ofdrm: Support color management

2022-08-04 Thread Benjamin Herrenschmidt

On Wed, 2022-07-27 at 10:41 +0200, Thomas Zimmermann wrote:
> 
> > > +static void __iomem *ofdrm_mach64_cmap_ioremap(struct ofdrm_device *odev,
> > > +struct device_node *of_node,
> > > +u64 fb_base)
> > > +{
> > > + struct drm_device *dev = >dev;
> > > + u64 address;
> > > + void __iomem *cmap_base;
> > > +
> > > + address = fb_base & 0xff00ul;
> > > + address += 0x7ff000;
> > > +
> > 
> > It would be good to know where these addresses are coming from. Maybe some
> > constant macros or a comment ? Same for the other places where addresses
> > and offsets are used.
> 
> I have no idea where these values come from. I took them from offb. And 
> I suspect that some of these CMAP helpers could be further merged if 
> only it was clear where the numbers come from.  But as i don't have the 
> equipment for testing, I took most of this literally as-is from offb.

Ancient black magic :-) Old ATI mach64 chips had the registers sitting
at the end of the framebuffer. You can find an equivalent in
drivers/video/aty/atyfb_base.c:atyfb_setup_generic():

raddr = addr + 0x7ff000UL;

Cheers,
Ben.

Re: [PATCH v2 09/10] drm/ofdrm: Add per-model device function

2022-08-04 Thread Benjamin Herrenschmidt

On Tue, 2022-07-26 at 16:40 +0200, Michal Suchánek wrote:
> Hello,
> 
> On Tue, Jul 26, 2022 at 03:38:37PM +0200, Javier Martinez Canillas wrote:
> > On 7/20/22 16:27, Thomas Zimmermann wrote:
> > > Add a per-model device-function structure in preparation of adding
> > > color-management support. Detection of the individual models has been
> > > taken from fbdev's offb.
> > > 
> > > Signed-off-by: Thomas Zimmermann 
> > > ---
> > 
> > Reviewed-by: Javier Martinez Canillas 
> > 
> > [...]
> > 
> > > +static bool is_avivo(__be32 vendor, __be32 device)
> > > +{
> > > + /* This will match most R5xx */
> > > + return (vendor == 0x1002) &&
> > > +((device >= 0x7100 && device < 0x7800) || (device >= 0x9400));
> > > +}
> > 
> > Maybe add some constant macros to not have these magic numbers ?
> 
> This is based on the existing fbdev implementation's magic numbers:
> 
> drivers/video/fbdev/offb.c: ((*did >= 0x7100 && *did < 
> 0x7800) ||
> 
> Of course, it would be great if somebody knowledgeable could clarify
> those.

I don't think anybody remembers :-) Vendor 0x1002 is PCI_VENDOR_ID_ATI,
but the rest is basically ranges of PCI IDs for which we don't have
symbolic constants.

Cheers,
Ben.

Re: [PATCH v2 10/10] drm/ofdrm: Support color management

2022-08-04 Thread Benjamin Herrenschmidt

On Wed, 2022-07-20 at 16:27 +0200, Thomas Zimmermann wrote:
> +#if !defined(CONFIG_PPC)
> +static inline void out_8(void __iomem *addr, int val)
> +{ }
> +static inline void out_le32(void __iomem *addr, int val)
> +{ }
> +static inline unsigned int in_le32(const void __iomem *addr)
> +{
> +   return 0;
> +}
> +#endif

These guys could just be replaced with readb/writel/readl respectively
(beware of the argument swap).

Cheers,
Ben.

Re: [PATCH] powerpc/powermac/pfunc_base: Fix refcount leak bug in macio_gpio_init_one()

2022-08-01 Thread Benjamin Herrenschmidt

On Sat, 2022-07-16 at 15:31 +0800, Liang He wrote:
> We should call of_node_put() for the reference 'gparent' escaped
> out of the for_each_child_of_node() as it has increased the refcount.

Same comment as before. That stuff happens once at boot, there's never
any dynamic allocation/deallocation of these, they just don't matter,
but feel free  :-)

> Fixes: 5b9ca526917b ("[PATCH] 3/5 powerpc: Add platform functions
> interpreter")
> Signed-off-by: Liang He 
> ---
>  arch/powerpc/platforms/powermac/pfunc_base.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powermac/pfunc_base.c
> b/arch/powerpc/platforms/powermac/pfunc_base.c
> index 9c2947a3edd5..085e0ad20eba 100644
> --- a/arch/powerpc/platforms/powermac/pfunc_base.c
> +++ b/arch/powerpc/platforms/powermac/pfunc_base.c
> @@ -136,6 +136,8 @@ static void __init macio_gpio_init_one(struct
> macio_chip *macio)
>   for_each_child_of_node(gparent, gp)
>   pmf_do_functions(gp, NULL, 0, PMF_FLAGS_ON_INIT, NULL);
>  
> + of_node_put(gparent);
> +
>   /* Note: We do not at this point implement the "at sleep" or
> "at wake"
>* functions. I yet to find any for GPIOs anyway
>*/

Re: [PATCH] powerpc/powermac/udbg_scc: Fix refcount leak bug in udbg_scc_init()

2022-08-01 Thread Benjamin Herrenschmidt

On Sat, 2022-07-16 at 15:43 +0800, Liang He wrote:
> During the iteration of for_each_child_of_node(), we need to call
> of_node_put() for the old references stored in to 'ch_def' and 'ch_a'
> as their refcounters have been increased in last iteration.

None of these matter since those nodes are never *ever* released and
those machines don't use dynamic node allocation but ...

> Fixes: 51d3082fe6e5 ("[PATCH] powerpc: Unify udbg (#2)")
> Signed-off-by: Liang He 
> ---
>  arch/powerpc/platforms/powermac/udbg_scc.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powermac/udbg_scc.c
> b/arch/powerpc/platforms/powermac/udbg_scc.c
> index 734df5a32f99..1b7c39e841ee 100644
> --- a/arch/powerpc/platforms/powermac/udbg_scc.c
> +++ b/arch/powerpc/platforms/powermac/udbg_scc.c
> @@ -81,10 +81,14 @@ void __init udbg_scc_init(int force_scc)
>   if (path != NULL)
>   stdout = of_find_node_by_path(path);
>   for_each_child_of_node(escc, ch) {
> - if (ch == stdout)
> + if (ch == stdout) {
> + of_node_put(ch_def);
>   ch_def = of_node_get(ch);
> - if (of_node_name_eq(ch, "ch-a"))
> + }
> + if (of_node_name_eq(ch, "ch-a")) {
> + of_node_put(ch_a);
>   ch_a = of_node_get(ch);
> + }
>   }
>   if (ch_def == NULL && !force_scc)
>   goto bail;

Re: [PATCH] macintosh:fix oob read in do_adb_query function

2022-07-14 Thread Benjamin Herrenschmidt

On Wed, 2022-07-13 at 11:53 -0700, Kees Cook wrote:
> On Wed, Jul 13, 2022 at 11:37:34PM +0800, Ning Qiang wrote:
> > In do_adb_query function of drivers/macintosh/adb.c, req->data is
> > copy
> > form userland. the  parameter "req->data[2]" is Missing check, the
> > array size of adb_handler[] is 16, so "adb_handler[
> > req->data[2]].original_address" and "adb_handler[
> > req->data[2]].handler_id" will lead to oob read.
> > 
> > Signed-off-by: Ning Qiang 
> 
> Thanks for catching this!
> 
> Do you have a reproducer for this? I'd expect CONFIG_UBSAN_BOUNDS=y
> to notice this at runtime, at least.

For that you would need an ancient Mac with an ADB bus which might be
tricky ... I have some in the basement that could possibly be revived
if you really insist but I'd rather not waste the time...

Cheers,
Ben.

> 
> > ---
> >  drivers/macintosh/adb.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/macintosh/adb.c b/drivers/macintosh/adb.c
> > index 439fab4eaa85..1bbb9ca08d40 100644
> > --- a/drivers/macintosh/adb.c
> > +++ b/drivers/macintosh/adb.c
> > @@ -647,7 +647,7 @@ do_adb_query(struct adb_request *req)
> >  
> > switch(req->data[1]) {
> > case ADB_QUERY_GETDEVINFO:
> > -   if (req->nbytes < 3)
> > +   if (req->nbytes < 3 || req->data[2] >= 16)
> 
> I'd prefer this was:
> 
> + if (req->nbytes < 3 || req->data[2] >=
> ARRAY_SIZE(adb_handler))
> 
> so it's tied to the actual variable (if its size ever changes).
> 
> With that:
> 
> Reviewed-by: Kees Cook 
> 
> -Kees
> 
> > break;
> > mutex_lock(_handler_mutex);
> > req->reply[0] = adb_handler[req-
> > >data[2]].original_address;
> > -- 
> > 2.25.1
> >

Re: [PATCH] macintosh:fix oob read in do_adb_query function

2022-07-14 Thread Benjamin Herrenschmidt

On Wed, 2022-07-13 at 23:37 +0800, Ning Qiang wrote:
> In do_adb_query function of drivers/macintosh/adb.c, req->data is
> copy
> form userland. the  parameter "req->data[2]" is Missing check, the
> array size of adb_handler[] is 16, so "adb_handler[
> req->data[2]].original_address" and "adb_handler[
> req->data[2]].handler_id" will lead to oob read.
> 
> Signed-off-by: Ning Qiang 

Acked-by: Benjamin Herrenschmidt 

> ---
>  drivers/macintosh/adb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/macintosh/adb.c b/drivers/macintosh/adb.c
> index 439fab4eaa85..1bbb9ca08d40 100644
> --- a/drivers/macintosh/adb.c
> +++ b/drivers/macintosh/adb.c
> @@ -647,7 +647,7 @@ do_adb_query(struct adb_request *req)
>  
>   switch(req->data[1]) {
>   case ADB_QUERY_GETDEVINFO:
> - if (req->nbytes < 3)
> + if (req->nbytes < 3 || req->data[2] >= 16)
>   break;
>   mutex_lock(_handler_mutex);
>   req->reply[0] = adb_handler[req-
> >data[2]].original_address;

Re: oob read in do_adb_query function

2022-07-12 Thread Benjamin Herrenschmidt

On Wed, 2022-07-13 at 09:54 +0800, sohu0106 wrote:
> 
> 
> In do_adb_query function of drivers/macintosh/adb.c, req->data is
> copy form userland. the  parameter "req->data[2]" is Missing check,
> the array size of adb_handler[] is 16, so "adb_handler[req-
> >data[2]].original_address" and "adb_handler[req-
> >data[2]].handler_id" will lead to oob read.

Thanks ! Can you send this with a proper Signed-off-by so we can apply
directly ?

Cheers,
Ben.
> 
> 
> diff --git a/adb.c b/adb.c_patch
> index 73b3961..8a5604b 100644
> --- a/adb.c
> +++ b/adb.c_patch
> @@ -647,7 +647,7 @@ do_adb_query(struct adb_request *req)
> 
> 
> switch(req->data[1]) {
> case ADB_QUERY_GETDEVINFO:
> -   if (req->nbytes < 3)
> +   if (req->nbytes < 3 || req->data[2] > 16)
> break;
> mutex_lock(_handler_mutex);
> req->reply[0] = adb_handler[req-

Re: [PATCH] powerpc: Update reviewers

2022-07-05 Thread Benjamin Herrenschmidt

On Wed, 2022-06-29 at 16:08 +1000, Michael Ellerman wrote:
> Christophe and Nick have been active in recent years on the mailing
> list
> and making contributions, add them as reviewers.
> 
> Paul and Ben are no longer actively reviewing powerpc patches, remove
> them from the reviewers, they're still on linuxppc-dev if needed.
> 
> Signed-off-by: Michael Ellerman 

Acked-by: Benjamin Herrenschmidt 

> ---
>  MAINTAINERS | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1fc9ead83d2a..af4cfeec9d0f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11345,8 +11345,8 @@ F:drivers/macintosh/
>  
>  LINUX FOR POWERPC (32-BIT AND 64-BIT)
>  M:   Michael Ellerman 
> -R:   Benjamin Herrenschmidt 
> -R:   Paul Mackerras 
> +R:   Nicholas Piggin 
> +R:   Christophe Leroy 
>  L:   linuxppc-dev@lists.ozlabs.org
>  S:   Supported
>  W:   https://github.com/linuxppc/wiki/wiki

Re: [PATCH net-next] eth: de4x5: remove support for Generic DECchip & DIGITAL EtherWORKS PCI/EISA

2022-05-26 Thread Benjamin Herrenschmidt

On Thu, 2022-05-26 at 09:43 +0200, Geert Uytterhoeven wrote:
>   Hi Ben,
> 
> On Sat, 21 May 2022, Benjamin Herrenschmidt wrote:
> > On Wed, 2022-05-18 at 20:13 -0700, Jakub Kicinski wrote:
> > > Looks like almost all changes to this driver had been tree-wide
> > > refactoring since git era begun. There is one commit from Al
> > > 15 years ago which could potentially be fixing a real bug.
> > > 
> > > The driver is using virt_to_bus() and is a real magnet for pointless
> > > cleanups. It seems unlikely to have real users. Let's try to shed
> > > this maintenance burden.
> > > 
> > > Signed-off-by: Jakub Kicinski 
> > 
> > Removing this driver will kill support for some rather old PowerMac
> > models (some PowerBooks I think, paulus would know). No objection on my
> > part, though. I doubt people still use these things with new kernels
> > but ... who knows ? :-)
> 
> Aren't these PCI, and thus working fine with the PCI-only DE2104X
> (dc2104x) or TULIP (dc2114x) drivers?
> 
> IIRC, I've initially used the de4x5 driver on Alpha (UDB/Multia) or PPC
> (CHRP), but switched to the TULIP driver later (that was before the
> dc2104x/dc2114x driver split, hence a lng time ago).

I'm pretty sure there were some old Macs who worked with de4x5 and not
tulip but I wouldn't rememeber the details and I'm not sure any of this
hardware still exist in the field nor matters.

Cheers,
Ben.

Re: [PATCH 2/2] drm/tiny: Add ofdrm for Open Firmware framebuffers

2022-05-26 Thread Benjamin Herrenschmidt

On Wed, 2022-05-25 at 18:45 +0200, Thomas Zimmermann wrote:
> I don't mind adding DRM support for BootX displays, but getting the 
> necessary test HW with a suitable Linux seems to be laborious. Would
> a  G4 Powerbook work?

Probably not unfortunately... it's going to be tricky. I might sitll
have some old BootX-based machines somewhere in storage I could try to
dig out, but it might not be worth bothering too much...

Cheers,
Ben.

Re: [PATCH 2/2] drm/tiny: Add ofdrm for Open Firmware framebuffers

2022-05-26 Thread Benjamin Herrenschmidt

On Wed, 2022-05-25 at 18:45 +0200, Thomas Zimmermann wrote:
> 
> > The palette handling is useful when using a real Open Firmware
> > implementation which tends to boot in 8-bit mode, so without palette
> > things will look ... bad.
> > 
> > It's not necessary when using 16/32 bpp framebuffers which is typically
> > ... what BootX provides :-)
> 
> Maybe the odd color formats can be tested via qemu.
> 
> I don't mind adding DRM support for BootX displays, but getting the 
> necessary test HW with a suitable Linux seems to be laborious. Would a 
> G4 Powerbook work?

My point was that it's the non-BootX case that cares about the palette
hacks :-)

Cheers,
Ben.

Re: [PATCH 2/2] drm/tiny: Add ofdrm for Open Firmware framebuffers

2022-05-24 Thread Benjamin Herrenschmidt

On Sun, 2022-05-22 at 21:35 +0200, Thomas Zimmermann wrote:
> > Interesting. Did you find some formats that were not supported ?
> 
> We still don't support XRGB1555. If the native buffer uses this format, 
> we'd have no conversion helper. In this case, we rely on userspace/fbcon 
> to use the native format exclusively.  (BTW, I asked one of my coworkers 
> to implement XRGB1555 to make her familiar with DRM. That's why I 
> haven't sent a patch yet.)
> 

Various old macs do 1555 ...

Cheers,
Ben.

Re: [PATCH net-next] eth: de4x5: remove support for Generic DECchip & DIGITAL EtherWORKS PCI/EISA

2022-05-20 Thread Benjamin Herrenschmidt

On Wed, 2022-05-18 at 20:13 -0700, Jakub Kicinski wrote:
> Looks like almost all changes to this driver had been tree-wide
> refactoring since git era begun. There is one commit from Al
> 15 years ago which could potentially be fixing a real bug.
> 
> The driver is using virt_to_bus() and is a real magnet for pointless
> cleanups. It seems unlikely to have real users. Let's try to shed
> this maintenance burden.
> 
> Signed-off-by: Jakub Kicinski 

Removing this driver will kill support for some rather old PowerMac
models (some PowerBooks I think, paulus would know). No objection on my
part, though. I doubt people still use these things with new kernels
but ... who knows ? :-)

Cheers,
Ben.

Re: [PATCH 2/2] drm/tiny: Add ofdrm for Open Firmware framebuffers

2022-05-20 Thread Benjamin Herrenschmidt

On Thu, 2022-05-19 at 09:27 +0200, Thomas Zimmermann wrote:

> to build without PCI to see what happens.

If you bring any of the "heuristic" and palette support code in, you
need PCI. I don't see any reason to take it out.

> Those old Macs use BootX, right? BootX is not supported ATM, as I don't 
> have the HW to test. Is there an emulator for it?

It isn't ? When did it break ? :-)

> If anyone what's to make patches for BootX, I'd be happy to add them. 
> The offb driver also supports a number of special cases for palette 
> handling. That might be necessary for ofdrm as well.

The palette handling is useful when using a real Open Firmware
implementation which tends to boot in 8-bit mode, so without palette
things will look ... bad.

It's not necessary when using 16/32 bpp framebuffers which is typically
... what BootX provides :-)

Cheers,
Ben.

> Best regards
> Thomas
> 
> > Gr{oetje,eeting}s,
> > 
> >  Geert
> > 
> > --
> > Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- 
> > ge...@linux-m68k.org
> > 
> > In personal conversations with technical people, I call myself a hacker. But
> > when I'm talking to journalists I just say "programmer" or something like 
> > that.
> >  -- Linus Torvalds
> 
> -- 
> Thomas Zimmermann
> Graphics Driver Developer
> SUSE Software Solutions Germany GmbH
> Maxfeldstr. 5, 90409 Nürnberg, Germany
> (HRB 36809, AG Nürnberg)
> Geschäftsführer: Ivo Totev

Re: [PATCH 2/2] drm/tiny: Add ofdrm for Open Firmware framebuffers

2022-05-20 Thread Benjamin Herrenschmidt

On Thu, 2022-05-19 at 09:11 +0200, Geert Uytterhoeven wrote:
> Hi Michal,
> 
> 
> 
> On Wed, May 18, 2022 at 8:54 PM Michal Suchánek 
> wrote:
> 
> > On Wed, May 18, 2022 at 08:30:06PM +0200, Thomas Zimmermann wrote:
> > > --- a/drivers/gpu/drm/tiny/Kconfig
> > > +++ b/drivers/gpu/drm/tiny/Kconfig
> > > @@ -51,6 +51,18 @@ config DRM_GM12U320
> > > This is a KMS driver for projectors which use the
> > > GM12U320 chipset
> > > for video transfer over USB2/3, such as the Acer C120
> > > mini projector.
> > > +config DRM_OFDRM
> > > + tristate "Open Firmware display driver"
> > > + depends on DRM && MMU && PPC
> > Does this build with !PCI?
> > The driver uses some PCI functions, so it might possibly break with
> > randconfig. I don't think there are practical !PCI PPC
> > configurations.
> 
> 
> IIRC, the first PowerMacs didn't have PCI.

They also don't have OF and we never supported them upstream :-)

Cheers,
Ben.

Re: [PATCH] powerpc: warn on emulation of dcbz instruction

2021-09-17 Thread Benjamin Herrenschmidt

On Thu, 2021-09-16 at 14:36 +, David Laight wrote:
> > Does userspace accesses non-cached memory directly ?
> 
> 
> It probably can if a driver mmaps PCI space directly into user space.
> 
> That certainly works on x86-64.

The posterchild for that is Xorg

Cheers,
Ben.

Re: [PATCH] powerpc: warn on emulation of dcbz instruction

2021-09-16 Thread Benjamin Herrenschmidt

On Thu, 2021-09-16 at 17:15 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2021-09-15 at 16:31 +0200, Christophe Leroy wrote:
> > dcbz instruction shouldn't be used on non-cached memory. Using
> > it on non-cached memory can result in alignment exception and
> > implies a heavy handling.
> > 
> > Instead of silentely emulating the instruction and resulting in
> > high
> > performance degradation, warn whenever an alignment exception is
> > taken due to dcbz, so that the user is made aware that dcbz
> > instruction has been used unexpectedly.
> > 
> > Reported-by: Stan Johnson 
> > Cc: Finn Thain 
> > Signed-off-by: Christophe Leroy 
> > ---
> >  arch/powerpc/kernel/align.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/arch/powerpc/kernel/align.c
> > b/arch/powerpc/kernel/align.c
> > index bbb4181621dd..adc3a4a9c6e4 100644
> > --- a/arch/powerpc/kernel/align.c
> > +++ b/arch/powerpc/kernel/align.c
> > @@ -349,6 +349,7 @@ int fix_alignment(struct pt_regs *regs)
> > if (op.type != CACHEOP + DCBZ)
> > return -EINVAL;
> > PPC_WARN_ALIGNMENT(dcbz, regs);
> > +   WARN_ON_ONCE(1);
> 
> This is heavy handed ... It will be treated as an oops by various
> things uselessly spit out a kernel backtrace. Isn't
> PPC_WARN_ALIGNMENT
> enough ?

Ah I saw your other one about fbdev...  Ok what about you do that in a
if (!user_mode(regs)) ?

Indeed the kernel should not do that.

Cheers,
Ben.

Re: [PATCH] powerpc: warn on emulation of dcbz instruction

2021-09-16 Thread Benjamin Herrenschmidt

On Wed, 2021-09-15 at 16:31 +0200, Christophe Leroy wrote:
> dcbz instruction shouldn't be used on non-cached memory. Using
> it on non-cached memory can result in alignment exception and
> implies a heavy handling.
> 
> Instead of silentely emulating the instruction and resulting in high
> performance degradation, warn whenever an alignment exception is
> taken due to dcbz, so that the user is made aware that dcbz
> instruction has been used unexpectedly.
> 
> Reported-by: Stan Johnson 
> Cc: Finn Thain 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/align.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kernel/align.c
> b/arch/powerpc/kernel/align.c
> index bbb4181621dd..adc3a4a9c6e4 100644
> --- a/arch/powerpc/kernel/align.c
> +++ b/arch/powerpc/kernel/align.c
> @@ -349,6 +349,7 @@ int fix_alignment(struct pt_regs *regs)
>   if (op.type != CACHEOP + DCBZ)
>   return -EINVAL;
>   PPC_WARN_ALIGNMENT(dcbz, regs);
> + WARN_ON_ONCE(1);

This is heavy handed ... It will be treated as an oops by various
things uselessly spit out a kernel backtrace. Isn't PPC_WARN_ALIGNMENT
enough ?

Ben.

Re: Possible regression by ab037dd87a2f (powerpc/vdso: Switch VDSO to generic C implementation.)

2021-08-02 Thread Benjamin Herrenschmidt

On Mon, 2021-08-02 at 16:18 +1000, Michael Ellerman wrote:
> 
> But to be fair the ABI of the VDSO has always been a little fishy,
> 
> because the entry points pretend to be a transparent wrapper around
> 
> system calls, but then in a case like this are not.

This is somewhat debatable :-) If your perspective is from an
application, whether your wrapper is glibc, the vdso or *** knows what
else, you can't make assumptions about the state of the registers on a
signal hitting somewhere in your call chain other than what's
guanranteed by the ABI overall (ie, TLS etc...).

Nowhere was it written that a VDSO call behaved strictly like an sc
instruction :-)
> 
> > Go up to this point has only used the vdso function __kernel_clock_gettime; 
> > it
> > is the only entry point which would need to explicitly avoid R30 for
> > Go's sake.
> 
> 
> I thought about limiting the workaround to just that code, but it seemed
> silly and likely to come back to bite us. Once the compilers decides to
> spill a non-volatile there are plenty of other ones to choose from.

Yeah fine graining this is a waste of time, I agree. Just stick with
fixing r30 for the whole vdso, it won't actually hurt, just make sure
this is somewhat documented as to why we do it (I don't remember what
you patch does off hand, I assume at least your git commit has all the
data, a comment near the offending line in the Makefile might be good
too).

Cheers,
Ben.

Re: Possible regression by ab037dd87a2f (powerpc/vdso: Switch VDSO to generic C implementation.)

2021-07-27 Thread Benjamin Herrenschmidt

On Tue, 2021-07-27 at 10:45 +0200, Paul Menzel wrote:
> Dear Christophe,
> 
> 
> On ppc64le Go 1.16.2 from Ubuntu 21.04 terminates with a segmentation 
> fault [1], and it might be related to *[release-branch.go1.16] runtime: 
> fix crash during VDSO calls on PowerPC* [2], conjecturing that commit 
> ab037dd87a2f (powerpc/vdso: Switch VDSO to generic C implementation.) 
> added in Linux 5.11 causes this.
> 
> If this is indeed the case, this would be a regression in userspace. Is 
> there a generic fix or should the change be reverted?

>From the look at the links you posted, this appears to be completely
broken assumptions by Go that some registers don't change while calling
what essentially are external library functions *while inside those
functions* (ie in this case from a signal handler).

I suppose it would be possible to build the VDSO with gcc arguments to
make it not use r30, but that's just gross...

Cheers,
Ben.

Re: [PATCH net-next v4 2/2] of: net: fix of_get_mac_addr_nvmem() for non-platform devices

2021-04-26 Thread Benjamin Herrenschmidt

On Mon, 2021-04-26 at 12:54 +0200, Michael Walle wrote:
> Before I'll try to come up with a patch for this, I'd like to get
> your opinion on it.
> 
> (1) replacing of_get_mac_address(node) with eth_get_mac_address(dev)
>  might sometimes lead to confusing comments like in
>  drivers/net/ethernet/allwinner/sun4i-emac.c:
> 
>  /* Read MAC-address from DT */
>  ret = of_get_mac_address(np, ndev->dev_addr);

You could leave it or turn it into "from platform", doesn't matter...

> (2) What do you think of eth_get_mac_address(ndev). That is, the

Not sure what you mean, eth_platform_get_mac_address() takes the
address as an argument. I think what you want is a consolidated
nvmem_get_mac_address + eth_platform_get_mac_address that takes a
device, which would have no requirement of the bus_type at all.

Cheers,
Ben.

Re: [PATCH net-next v4 2/2] of: net: fix of_get_mac_addr_nvmem() for non-platform devices

2021-04-15 Thread Benjamin Herrenschmidt

On Mon, 2021-04-12 at 19:47 +0200, Michael Walle wrote:
> 
>  /**
>   * of_get_phy_mode - Get phy mode for given device_node
> @@ -59,15 +60,39 @@ static int of_get_mac_addr(struct device_node *np, const 
> char *name, u8 *addr)
>  static int of_get_mac_addr_nvmem(struct device_node *np, u8 *addr)
>  {
> struct platform_device *pdev = of_find_device_by_node(np);
> +   struct nvmem_cell *cell;
> +   const void *mac;
> +   size_t len;
> int ret;
>  
> -   if (!pdev)
> -   return -ENODEV;
> +   /* Try lookup by device first, there might be a nvmem_cell_lookup
> +* associated with a given device.
> +*/
> +   if (pdev) {
> +   ret = nvmem_get_mac_address(>dev, addr);
> +   put_device(>dev);
> +   return ret;
> +   }
> +

This smells like the wrong band aid :)

Any struct device can contain an OF node pointer these days.

This seems all backwards. I think we are dealing with bad evolution.

We need to do a lookup for the device because we get passed an of_node.
We should just get passed a device here... or rather stop calling
of_get_mac_addr() from all those drivers and instead call
eth_platform_get_mac_address() which in turns calls of_get_mac_addr().

Then the nvmem stuff gets put in eth_platform_get_mac_address().

of_get_mac_addr() becomes a low-level thingy that most drivers don't
care about.

Cheers,
Ben.

Re: [PATCH] powerpc/64s: power4 nap fixup in C

2021-03-29 Thread Benjamin Herrenschmidt

On Fri, 2021-03-12 at 11:20 +1000, Nicholas Piggin wrote:
> 
> +static inline void nap_adjust_return(struct pt_regs *regs)
> 
> +{
> 
> +#ifdef CONFIG_PPC_970_NAP
> 
> +   if (unlikely(test_thread_local_flags(_TLF_NAPPING))) {
> +   /* Can avoid a test-and-clear because NMIs do not call this */
> +   clear_thread_local_flags(_TLF_NAPPING);
> +   regs->nip = (unsigned long)power4_idle_nap_return;
> +   }

Is this a pointer to a function descriptor or the actual code ?

Cheers,
Ben.

> +#endif
> 
> +}
> 
> +
> 
>  struct interrupt_state {
> 
>  #ifdef CONFIG_PPC_BOOK3E_64
> 
> enum ctx_state ctx_state;
> 
> @@ -111,6 +122,9 @@ static inline void interrupt_async_exit_prepare(struct 
> pt_regs *regs, struct int
> 
>  {
> 
> irq_exit();
> 
> interrupt_exit_prepare(regs, state);
> 
> +
> 
> +   /* Adjust at exit so the main handler sees the true NIA */
> 
> +   nap_adjust_return(regs);
> 
>  }
> 
>  
> 
>  struct interrupt_nmi_state {
> 
> @@ -164,6 +178,11 @@ static inline void interrupt_nmi_exit_prepare(struct 
> pt_regs *regs, struct inter
> 
> radix_enabled() || (mfmsr() & MSR_DR))
> 
> nmi_exit();
> 
>  
> 
> +   /*
> 
> +* nmi does not call nap_adjust_return because nmi should not create
> 
> +* new work to do (must use irq_work for that).
> 
> +*/
> 
> +
> 
>  #ifdef CONFIG_PPC64
> 
> if (TRAP(regs) != 0x900 && TRAP(regs) != 0xf00 && TRAP(regs) != 0x260)
> 
> this_cpu_set_ftrace_enabled(state->ftrace_enabled);
> 
> diff --git a/arch/powerpc/include/asm/processor.h 
> b/arch/powerpc/include/asm/processor.h
> 
> index 8acc3590c971..eedc3c775141 100644
> 
> --- a/arch/powerpc/include/asm/processor.h
> 
> +++ b/arch/powerpc/include/asm/processor.h
> 
> @@ -393,6 +393,7 @@ extern unsigned long isa300_idle_stop_mayloss(unsigned 
> long psscr_val);
> 
>  extern unsigned long isa206_idle_insn_mayloss(unsigned long type);
> 
>  #ifdef CONFIG_PPC_970_NAP
> 
>  extern void power4_idle_nap(void);
> 
> +void power4_idle_nap_return(void);
> 
>  #endif
> 
>  
> 
>  extern unsigned long cpuidle_disable;
> 
> diff --git a/arch/powerpc/include/asm/thread_info.h 
> b/arch/powerpc/include/asm/thread_info.h
> 
> index 386d576673a1..bf137151100b 100644
> 
> --- a/arch/powerpc/include/asm/thread_info.h
> 
> +++ b/arch/powerpc/include/asm/thread_info.h
> 
> @@ -152,6 +152,12 @@ void arch_setup_new_exec(void);
> 
>  
> 
>  #ifndef __ASSEMBLY__
> 
>  
> 
> +static inline void clear_thread_local_flags(unsigned int flags)
> 
> +{
> 
> +   struct thread_info *ti = current_thread_info();
> 
> +   ti->local_flags &= ~flags;
> 
> +}
> 
> +
> 
>  static inline bool test_thread_local_flags(unsigned int flags)
> 
>  {
> 
> struct thread_info *ti = current_thread_info();
> 
> diff --git a/arch/powerpc/kernel/exceptions-64s.S 
> b/arch/powerpc/kernel/exceptions-64s.S
> 
> index 60d3051a8bc8..ea7a443488d2 100644
> 
> --- a/arch/powerpc/kernel/exceptions-64s.S
> 
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> 
> @@ -692,25 +692,6 @@ END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
> 
> ld  r1,GPR1(r1)
> 
>  .endm
> 
>  
> 
> -/*
> 
> - * When the idle code in power4_idle puts the CPU into NAP mode,
> 
> - * it has to do so in a loop, and relies on the external interrupt
> 
> - * and decrementer interrupt entry code to get it out of the loop.
> 
> - * It sets the _TLF_NAPPING bit in current_thread_info()->local_flags
> 
> - * to signal that it is in the loop and needs help to get out.
> 
> - */
> 
> -#ifdef CONFIG_PPC_970_NAP
> 
> -#define FINISH_NAP \
> 
> -BEGIN_FTR_SECTION  \
> 
> -   ld  r11, PACA_THREAD_INFO(r13); \
> 
> -   ld  r9,TI_LOCAL_FLAGS(r11); \
> 
> -   andi.   r10,r9,_TLF_NAPPING;\
> 
> -   bnelpower4_fixup_nap;   \
> 
> -END_FTR_SECTION_IFSET(CPU_FTR_CAN_NAP)
> 
> -#else
> 
> -#define FINISH_NAP
> 
> -#endif
> 
> -
> 
>  /*
> 
>   * There are a few constraints to be concerned with.
> 
>   * - Real mode exceptions code/data must be located at their physical 
> location.
> 
> @@ -1248,7 +1229,6 @@ EXC_COMMON_BEGIN(machine_check_common)
> 
>  */
> 
> GEN_COMMON machine_check
> 
>  
> 
> -   FINISH_NAP
> 
> /* Enable MSR_RI when finished with PACA_EXMC */
> 
> li  r10,MSR_RI
> 
> mtmsrd  r10,1
> 
> @@ -1571,7 +1551,6 @@ EXC_VIRT_BEGIN(hardware_interrupt, 0x4500, 0x100)
> 
>  EXC_VIRT_END(hardware_interrupt, 0x4500, 0x100)
> 
>  EXC_COMMON_BEGIN(hardware_interrupt_common)
> 
> GEN_COMMON hardware_interrupt
> 
> -   FINISH_NAP
> 
> addir3,r1,STACK_FRAME_OVERHEAD
> 
> bl  do_IRQ
> 
> b   interrupt_return
> 
> @@ -1801,7 +1780,6 @@ EXC_VIRT_BEGIN(decrementer, 0x4900, 0x80)
> 
>  EXC_VIRT_END(decrementer, 0x4900, 0x80)
> 
>

Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-23 Thread Benjamin Herrenschmidt

On Thu, 2020-07-23 at 01:21 -0400, Alex Ghiti wrote:
> > works fine with huge pages, what is your problem there ? You rely on
> > punching small-page size holes in there ?
> > 
> 
> ARCH_HAS_STRICT_KERNEL_RWX prevents the use of a hugepage for the kernel 
> mapping in the direct mapping as it sets different permissions to 
> different part of the kernel (data, text..etc).

Ah ok, that can be solved in a couple of ways...

One is to use the linker script to ensure those sections are linked
HUGE_PAGE_SIZE appart and moved appropriately by early boot code. One
is to selectively degrade just those huge pages.

I'm not familiar with the RiscV MMU (I should probably go have a look)
but if it's a classic radix tree with huge pages at PUD/PMD level, then
you could just degrade the one(s) that cross those boundaries.

Cheers,
Ben.

Re: [RFC PATCH] powerpc/pseries/svm: capture instruction faulting on MMIO access, in sprg0 register

2020-07-21 Thread Benjamin Herrenschmidt

On Wed, 2020-07-22 at 12:06 +1000, Michael Ellerman wrote:
> Ram Pai  writes:
> > An instruction accessing a mmio address, generates a HDSI fault.  This 
> > fault is
> > appropriately handled by the Hypervisor.  However in the case of secureVMs, 
> > the
> > fault is delivered to the ultravisor.
> > 
> > Unfortunately the Ultravisor has no correct-way to fetch the faulting
> > instruction. The PEF architecture does not allow Ultravisor to enable MMU
> > translation. Walking the two level page table to read the instruction can 
> > race
> > with other vcpus modifying the SVM's process scoped page table.
> 
> You're trying to read the guest's kernel text IIUC, that mapping should
> be stable. Possibly permissions on it could change over time, but the
> virtual -> real mapping should not.

This will of course no longer work if userspace tries to access MMIO
but as long as you are ok limiting MMIO usage to the kernel, that's one
way.

iirc MMIO emulation in KVM on powerpc already has that race because of
HW TLB inval broadcast and it hasn't hurt anybody ... so far.

> > This problem can be correctly solved with some help from the kernel.
> > 
> > Capture the faulting instruction in SPRG0 register, before executing the
> > faulting instruction. This enables the ultravisor to easily procure the
> > faulting instruction and emulate it.
> 
> This is not something I'm going to merge. Sorry.

Another approach is to have the guest explicitly switch to using UV
calls for MMIO.

Cheers,
Ben.

Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-21 Thread Benjamin Herrenschmidt

On Tue, 2020-07-21 at 16:48 -0700, Palmer Dabbelt wrote:
> > Why ? Branch distance limits ? You can't use trampolines ?
> 
> Nothing fundamental, it's just that we don't have a large code model in the C
> compiler.  As a result all the global symbols are resolved as 32-bit
> PC-relative accesses.  We could fix this with a fast large code model, but 
> then
> the kernel would need to relax global symbol references in modules and we 
> don't
> even do that for the simple code models we have now.  FWIW, some of the
> proposed large code models are essentially just split-PLT/GOT and therefor
> don't require relaxation, but at that point we're essentially PIC until we
> have more that 2GiB of kernel text -- and even then, we keep all the
> performance issues.

My memory might be out of date but I *think* we do it on powerpc
without going to a large code model, but just having the in-kernel
linker insert trampolines.

Cheers,
Ben.

Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-21 Thread Benjamin Herrenschmidt

On Tue, 2020-07-21 at 12:05 -0700, Palmer Dabbelt wrote:
> 
> * We waste vmalloc space on 32-bit systems, where there isn't a lot of it.
> * On 64-bit systems the VA space around the kernel is precious because it's 
> the
>   only place we can place text (modules, BPF, whatever). 

Why ? Branch distance limits ? You can't use trampolines ?

>  If we start putting
>   the kernel in the vmalloc space then we either have to pre-allocate a bunch
>   of space around it (essentially making it a fixed mapping anyway) or it
>   becomes likely that we won't be able to find space for modules as they're
>   loaded into running systems.

I dislike the kernel being in the vmalloc space (see my other email)
but I don't understand the specific issue with modules.

> * Relying on a relocatable kernel for sv48 support introduces a fairly large
>   performance hit.

Out of curiosity why would relocatable kernels introduce a significant
hit ? Where about do you see the overhead coming from ?

> Roughly, my proposal would be to:
> 
> * Leave the 32-bit memory map alone.  On 32-bit systems we can load modules
>   anywhere and we only have one VA width, so we're not really solving any
>   problems with these changes.
> * Staticly allocate a 2GiB portion of the VA space for all our text, as its 
> own
>   region.  We'd link/relocate the kernel here instead of around PAGE_OFFSET,
>   which would decouple the kernel from the physical memory layout of the 
> system.
>   This would have the side effect of sorting out a bunch of bootloader 
> headaches
>   that we currently have.
> * Sort out how to maintain a linear map as the canonical hole moves around
>   between the VA widths without adding a bunch of overhead to the virt2phys 
> and
>   friends.  This is probably going to be the trickiest part, but I think if we
>   just change the page table code to essentially lie about VAs when an sv39
>   system runs an sv48+sv39 kernel we could make it work -- there'd be some
>   logical complexity involved, but it would remain fast.
> 
> This doesn't solve the problem of virtually relocatable kernels, but it does
> let us decouple that from the sv48 stuff.  It also lets us stop relying on a
> fixed physical address the kernel is loaded into, which is another thing I
> don't like.
> 
> I know this may be a more complicated approach, but there aren't any sv48
> systems around right now so I just don't see the rush to support them,
> particularly when there's a cost to what already exists (for those who haven't
> been watching, so far all the sv48 patch sets have imposed a significant
> performance penalty on all systems).

Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-21 Thread Benjamin Herrenschmidt

On Tue, 2020-07-21 at 14:36 -0400, Alex Ghiti wrote:
> > > I guess I don't understand why this is necessary at all.  
> > > Specifically: why
> > > can't we just relocate the kernel within the linear map?  That would 
> > > let the
> > > bootloader put the kernel wherever it wants, modulo the physical 
> > > memory size we
> > > support.  We'd need to handle the regions that are coupled to the 
> > > kernel's
> > > execution address, but we could just put them in an explicit memory 
> > > region
> > > which is what we should probably be doing anyway.
> > 
> > Virtual relocation in the linear mapping requires to move the kernel 
> > physically too. Zong implemented this physical move in its KASLR RFC 
> > patchset, which is cumbersome since finding an available physical spot 
> > is harder than just selecting a virtual range in the vmalloc range.
> > 
> > In addition, having the kernel mapping in the linear mapping prevents 
> > the use of hugepage for the linear mapping resulting in performance loss 
> > (at least for the GB that encompasses the kernel).
> > 
> > Why do you find this "ugly" ? The vmalloc region is just a bunch of 
> > available virtual addresses to whatever purpose we want, and as noted by 
> > Zong, arm64 uses the same scheme.

I don't get it :-)

At least on powerpc we move the kernel in the linear mapping and it
works fine with huge pages, what is your problem there ? You rely on
punching small-page size holes in there ?

At least in the old days, there were a number of assumptions that
the kernel text/data/bss resides in the linear mapping.

If you change that you need to ensure that it's still physically
contiguous and you'll have to tweak __va and __pa, which might induce
extra overhead.

Cheers,
Ben.

Re: [PATCH] powerpc/64: Fix an out of date comment about MMIO ordering

2020-07-16 Thread Benjamin Herrenschmidt

On Thu, 2020-07-16 at 12:38 -0700, Palmer Dabbelt wrote:
> From: Palmer Dabbelt 
> 
> This primitive has been renamed, but because it was spelled incorrectly in the
> first place it must have escaped the fixup patch.  As far as I can tell this
> logic is still correct: smp_mb__after_spinlock() uses the default smp_mb()
> implementation, which is "sync" rather than "hwsync" but those are the same
> (though I'm not that familiar with PowerPC).

Typo ? That must be me ... :)

Looks fine. Yes, sync and hwsync are the same (by opposition to lwsync
which is lighter weight and doesn't order cache inhibited).

Cheers,
Ben.

> Signed-off-by: Palmer Dabbelt 
> ---
>  arch/powerpc/kernel/entry_64.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index b3c9f15089b6..7b38b4daca93 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -357,7 +357,7 @@ _GLOBAL(_switch)
>* kernel/sched/core.c).
>*
>* Uncacheable stores in the case of involuntary preemption must
> -  * be taken care of. The smp_mb__before_spin_lock() in __schedule()
> +  * be taken care of. The smp_mb__after_spinlock() in __schedule()
>* is implemented as hwsync on powerpc, which orders MMIO too. So
>* long as there is an hwsync in the context switch path, it will
>* be executed on the source CPU after the task has performed

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-15 Thread Benjamin Herrenschmidt

On Wed, 2020-07-15 at 17:12 -0500, Bjorn Helgaas wrote:
> > I've 'played' with PCIe error handling - without much success.
> > What might be useful is for a driver that has just read ~0u to
> > be able to ask 'has there been an error signalled for this device?'.
> 
> In many cases a driver will know that ~0 is not a valid value for the
> register it's reading.  But if ~0 *could* be valid, an interface like
> you suggest could be useful.  I don't think we have anything like that
> today, but maybe we could.  It would certainly be nice if the PCI core
> noticed, logged, and cleared errors.  We have some of that for AER,
> but that's an optional feature, and support for the error bits in the
> garden-variety PCI_STATUS register is pretty haphazard.  As you note
> below, this sort of SERR/PERR reporting is frequently hard-wired in
> ways that takes it out of our purview.

We do have pci_channel_state (via pci_channel_offline()) which covers
the cases where the underlying error handling (such as EEH or unplug)
results in the device being offlined though this tend to be
asynchronous so it might take a few ~0's before you get it.

It's typically used to break potentially infinite loops in some
drivers.

There is no interface to check whether *an* error happened though for
the most cases it will be captured in the status register, which is
harvested (and cleared ?) by some EDAC drivers iirc... 

All this lacks coordination, I agree.

Cheers,
Ben.

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-15 Thread Benjamin Herrenschmidt

On Wed, 2020-07-15 at 08:47 +0200, Arnd Bergmann wrote:
> drivers/misc/cardreader/rts5261.c:  retval =
> rtsx_pci_write_config_dword(pcr, PCR_SETTING_REG2, lval);
> 
> That last one is interesting because I think this is a case in which
> we
> actually want to check for errors, as the driver seems to use it
> to ensure that accessing extended config space at offset 0x814
> works before relying on the value. Unfortunately the implementation
> seems buggy as it a) keeps using the possibly uninitialized value
> after
> printing a warning and b) returns the PCIBIOS_* value in place of a
> negative errno and then ignores it in the caller.

In cases like this, usually checking against ~0 is sufficient

Cheers,
Ben.

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Benjamin Herrenschmidt

On Tue, 2020-07-14 at 18:46 -0500, Bjorn Helgaas wrote:
> Yes.  I have no problem with that.  There are a few cases where it's
> important to check for errors, e.g., we read a status register and do
> something based on a bit being set.  A failure will return all bits
> set, and we may do the wrong thing.  But most of the errors we care
> about will be on MMIO reads, not config reads, so we can probably
> ignore most config read errors.

And in both cases, we don't have the plumbing to provide accurate
and reliable error returns for all platforms anyways (esp. not for
MMIO).

I think it makes sense to stick to the good old "if all 1's, then go
out of line" including for config space.

 ../..

> Yep, except for things like device removal or other PCI errors.

A whole bunch of these are reported asynchronously, esp for writes (and
yes, including config writes, they are supposed to be non-posted but
more often than not, the path  from the CPU to the PCI bridge remains
posted for writes including config ones).

> So maybe a good place to start is by removing some of the useless
> error checking for pci_read_config_*() and pci_write_config_*().
> That's a decent-sized but not impractical project that could be done
> per subsystem or something:
> 
>   git grep -E "(if|return|=).*\ 
> finds about 400 matches.
> 
> Some of those callers probably really *do* want to check for errors,
> and I guess we'd have to identify them and do them separately as you
> mentioned.

I'd be curious about these considering how unreliable our error return
is accross the board.

Cheers,
Ben.

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Benjamin Herrenschmidt

On Tue, 2020-07-14 at 23:02 +0200, Kjetil Oftedal wrote:
> > 
> > > For b), it might be nice to also change other aspects of the
> > > interface, e.g. passing a pci_host_bridge pointer plus bus number
> > > instead of a pci_bus pointer, or having the callback in the
> > > pci_host_bridge structure.
> > 
> > I like this idea a lot, too.  I think the fact that
> > pci_bus_read_config_word() requires a pci_bus * complicates things in
> > a few places.
> > 
> > I think it's completely separate, as you say, and we should defer it
> > for now because even part a) is a lot of work.  I added it to my list
> > of possible future projects.
> > 
> 
> What about strange PCI devices such as Non-Transparent bridges?
> They will require their own PCI Config space accessors that is not
> connected to a host bridge if one wants to do some sort of
> punch-through enumeration.
> I guess the kernel doesn't care much about them?

Well, today they would require a pci_bus anyway.. . so if you want to do
that sort of funny trick you may as well create a "virtual" host bridge.

Cheers,
Ben.

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Benjamin Herrenschmidt

On Tue, 2020-07-14 at 13:45 -0500, Bjorn Helgaas wrote:
> 
> > fail for valid arguments on a valid pci_device* ?
> 
> I really like this idea.
> 
> pci_write_config_*() has one return value, and only 100ish of 2500
> callers check for errors.  It's sometimes possible for config
> accessors to detect PCI errors and return failure, e.g., device was
> removed or didn't respond, but most of them don't, and detecting
> these
> errors is not really that valuable.
> 
> pci_read_config_*() is much more interesting because it returns two
> things, the function return value and the value read from the PCI
> device, and it's complicated to check both. 

  .../...

I agree. It's a mess at the moment.

We have separate mechanism to convey PCI errors (among other things the
channel state) which should apply to config space when detection is
possible.

I think returning all 1's is the right thing to do here and avoids odd
duplicate error detection logic which I bet you is never properly
tested.

> > For b), it might be nice to also change other aspects of the
> > interface, e.g. passing a pci_host_bridge pointer plus bus number
> > instead of a pci_bus pointer, or having the callback in the
> > pci_host_bridge structure.
> 
> I like this idea a lot, too.  I think the fact that
> pci_bus_read_config_word() requires a pci_bus * complicates things in
> a few places.
> 
> I think it's completely separate, as you say, and we should defer it
> for now because even part a) is a lot of work.  I added it to my list
> of possible future projects.

Agreed on both points.

Cheers,
Ben.

Re: [PATCH v2 2/2] powerpc/configs: replace deprecated riva/nvidia with nouveau

2020-05-18 Thread Benjamin Herrenschmidt

On Mon, 2020-05-18 at 12:19 +0100, Emil Velikov wrote:
> 
>  - attempted out-of-bound attempts to read the vbios

So on these things, the actual ROM doesn't contain what you want, but
the device-tree has a property "NVDA,BMP" that contains some kind of
mini-BIOS (around 2.4KB) which should contain the necessary tables the
driver is looking for.

I think nouveau has code to find these in nvkm/subdev/bios/shadowof.c,
so at least that should have been working, but maybe some
debugging/instrumentation would be useful there.

> Genuine concern or noise? Likely using the bios from open firmware,
> check any of the other options - see NvBios in [1]
>  - cannot figure out the timer input frequency
> No idea
>  - the TV1 EDID is empty
> Is there an actual TV connected to the device, check with another cable

Probaby not.

> Regardless of the patches, reporting [2] the above would be a nice move.
> 
> Thanks
> Emil
> [1] https://nouveau.freedesktop.org/wiki/KernelModuleParameters/
> [2] https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/-/issues

Re: [PATCH v2 2/2] powerpc/configs: replace deprecated riva/nvidia with nouveau

2020-05-18 Thread Benjamin Herrenschmidt

On Mon, 2020-05-18 at 12:00 +0100, Emil Velikov wrote:
> I believe you reported issues due to different page size for the CPU/GPU.
> Have you tried nouveau recently, there has been a handful of patches
> on the topic since your report.
> 
> Alternatively, it would make sense you rebase, cleanup and merge your patch.

That was a problem for the G5s. There were other issues for more
ancient machines with older nVidia GPUs. Additionally a lot of those
Apple machines don't have a BIOS ROM to get the various tables from.

At this stage unfortunately I don't have access to most of that HW to
test with anymore. I do have one G5 I might be able to dig out of my
basement this week to try out.

In any case, digging out that patch should be useful as powerpc64 is
still 64k pages :)

Cheers,
Ben.

Re: [PATCH v2 2/2] powerpc/configs: replace deprecated riva/nvidia with nouveau

2020-05-17 Thread Benjamin Herrenschmidt

On Sun, 2020-05-17 at 23:05 +0100, Emil Velikov wrote:
> As mentioned in earlier commit, the riva and nvidia fbdev drivers
> have
> seen no love over the years, are short on features and overall below
> par
> 
> Users are encouraged to switch to the nouveau drm driver instead.
> 
> v2: Split configs to separate patch, enable nouveau (Bartlomiej)

Back when I still had these things to play with (years ago) nouveau
didn't work properly on these ancient machines.

Cheers,
Ben.

> Cc: Antonino Daplas 
> Cc: Bartlomiej Zolnierkiewicz 
> Cc: linux-fb...@vger.kernel.org
> Cc: dri-de...@lists.freedesktop.org
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Emil Velikov 
> Acked-by: Daniel Vetter  (v1)
> ---
> Hi all unless, there are objections I would prefer to merge this via
> the drm tree.
> 
> Thanks
> Emil
> ---
>  arch/powerpc/configs/g5_defconfig | 10 --
>  arch/powerpc/configs/pasemi_defconfig |  9 +++--
>  arch/powerpc/configs/pmac32_defconfig |  9 +++--
>  arch/powerpc/configs/ppc6xx_defconfig | 10 +++---
>  4 files changed, 29 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/configs/g5_defconfig
> b/arch/powerpc/configs/g5_defconfig
> index a68c7f3af10e..213472f373b3 100644
> --- a/arch/powerpc/configs/g5_defconfig
> +++ b/arch/powerpc/configs/g5_defconfig
> @@ -124,12 +124,18 @@ CONFIG_RAW_DRIVER=y
>  CONFIG_I2C_CHARDEV=y
>  CONFIG_AGP=m
>  CONFIG_AGP_UNINORTH=m
> +CONFIG_DRM=y
> +CONFIG_DRM_NOUVEAU=m
> +# CONFIG_NOUVEAU_LEGACY_CTX_SUPPORT is not set
> +CONFIG_NOUVEAU_DEBUG=5
> +CONFIG_NOUVEAU_DEBUG_DEFAULT=3
> +# CONFIG_NOUVEAU_DEBUG_MMU is not set
> +CONFIG_DRM_NOUVEAU_BACKLIGHT=y
> +# CONFIG_DRM_NOUVEAU_SVM is not set
>  CONFIG_FB=y
>  CONFIG_FIRMWARE_EDID=y
>  CONFIG_FB_TILEBLITTING=y
>  CONFIG_FB_OF=y
> -CONFIG_FB_NVIDIA=y
> -CONFIG_FB_NVIDIA_I2C=y
>  CONFIG_FB_RADEON=y
>  # CONFIG_VGA_CONSOLE is not set
>  CONFIG_FRAMEBUFFER_CONSOLE=y
> diff --git a/arch/powerpc/configs/pasemi_defconfig
> b/arch/powerpc/configs/pasemi_defconfig
> index 08b7f4cef243..ccb3ab5e01da 100644
> --- a/arch/powerpc/configs/pasemi_defconfig
> +++ b/arch/powerpc/configs/pasemi_defconfig
> @@ -102,11 +102,16 @@ CONFIG_SENSORS_LM85=y
>  CONFIG_SENSORS_LM90=y
>  CONFIG_DRM=y
>  CONFIG_DRM_RADEON=y
> +CONFIG_DRM_NOUVEAU=m
> +# CONFIG_NOUVEAU_LEGACY_CTX_SUPPORT is not set
> +CONFIG_NOUVEAU_DEBUG=5
> +CONFIG_NOUVEAU_DEBUG_DEFAULT=3
> +# CONFIG_NOUVEAU_DEBUG_MMU is not set
> +CONFIG_DRM_NOUVEAU_BACKLIGHT=y
> +# CONFIG_DRM_NOUVEAU_SVM is not set
>  CONFIG_FIRMWARE_EDID=y
>  CONFIG_FB_TILEBLITTING=y
>  CONFIG_FB_VGA16=y
> -CONFIG_FB_NVIDIA=y
> -CONFIG_FB_NVIDIA_I2C=y
>  CONFIG_FB_RADEON=y
>  # CONFIG_LCD_CLASS_DEVICE is not set
>  CONFIG_VGACON_SOFT_SCROLLBACK=y
> diff --git a/arch/powerpc/configs/pmac32_defconfig
> b/arch/powerpc/configs/pmac32_defconfig
> index 05e325ca3fbd..f858627385c8 100644
> --- a/arch/powerpc/configs/pmac32_defconfig
> +++ b/arch/powerpc/configs/pmac32_defconfig
> @@ -199,6 +199,13 @@ CONFIG_DRM=m
>  CONFIG_DRM_RADEON=m
>  CONFIG_DRM_LEGACY=y
>  CONFIG_DRM_R128=m
> +CONFIG_DRM_NOUVEAU=m
> +# CONFIG_NOUVEAU_LEGACY_CTX_SUPPORT is not set
> +CONFIG_NOUVEAU_DEBUG=5
> +CONFIG_NOUVEAU_DEBUG_DEFAULT=3
> +# CONFIG_NOUVEAU_DEBUG_MMU is not set
> +CONFIG_DRM_NOUVEAU_BACKLIGHT=y
> +# CONFIG_DRM_NOUVEAU_SVM is not set
>  CONFIG_FB=y
>  CONFIG_FB_OF=y
>  CONFIG_FB_CONTROL=y
> @@ -206,8 +213,6 @@ CONFIG_FB_PLATINUM=y
>  CONFIG_FB_VALKYRIE=y
>  CONFIG_FB_CT65550=y
>  CONFIG_FB_IMSTT=y
> -CONFIG_FB_NVIDIA=y
> -CONFIG_FB_NVIDIA_I2C=y
>  CONFIG_FB_MATROX=y
>  CONFIG_FB_MATROX_MILLENIUM=y
>  CONFIG_FB_MATROX_MYSTIQUE=y
> diff --git a/arch/powerpc/configs/ppc6xx_defconfig
> b/arch/powerpc/configs/ppc6xx_defconfig
> index feb5d47d8d1e..48421f5007ed 100644
> --- a/arch/powerpc/configs/ppc6xx_defconfig
> +++ b/arch/powerpc/configs/ppc6xx_defconfig
> @@ -738,15 +738,19 @@ CONFIG_DRM_MGA=m
>  CONFIG_DRM_SIS=m
>  CONFIG_DRM_VIA=m
>  CONFIG_DRM_SAVAGE=m
> +CONFIG_DRM_NOUVEAU=m
> +# CONFIG_NOUVEAU_LEGACY_CTX_SUPPORT is not set
> +CONFIG_NOUVEAU_DEBUG=5
> +CONFIG_NOUVEAU_DEBUG_DEFAULT=3
> +# CONFIG_NOUVEAU_DEBUG_MMU is not set
> +CONFIG_DRM_NOUVEAU_BACKLIGHT=y
> +# CONFIG_DRM_NOUVEAU_SVM is not set
>  CONFIG_FB=y
>  CONFIG_FB_CIRRUS=m
>  CONFIG_FB_OF=y
>  CONFIG_FB_PLATINUM=y
>  CONFIG_FB_VALKYRIE=y
>  CONFIG_FB_CT65550=y
> -CONFIG_FB_NVIDIA=y
> -CONFIG_FB_NVIDIA_I2C=y
> -CONFIG_FB_RIVA=m
>  CONFIG_FB_MATROX=y
>  CONFIG_FB_MATROX_MILLENIUM=y
>  CONFIG_FB_MATROX_MYSTIQUE=y

Re: [PATCH 19/28] gpu/drm: remove the powerpc hack in drm_legacy_sg_alloc

2020-04-09 Thread Benjamin Herrenschmidt

On Thu, 2020-04-09 at 11:41 +0200, Daniel Vetter wrote:
> Now if these boxes didn't ever have agp then I think we can get away
> with deleting this, since we've already deleted the legacy radeon
> driver. And that one used vmalloc for everything. The new kms one does
> use the dma-api if the gpu isn't connected through agp

Definitely no AGP there.

Cheers
Ben.

Re: [PATCH 19/28] gpu/drm: remove the powerpc hack in drm_legacy_sg_alloc

2020-04-09 Thread Benjamin Herrenschmidt

On Wed, 2020-04-08 at 14:25 +0200, Daniel Vetter wrote:
> On Wed, Apr 08, 2020 at 01:59:17PM +0200, Christoph Hellwig wrote:
> > If this code was broken for non-coherent caches a crude powerpc hack
> > isn't going to help anyone else.  Remove the hack as it is the last
> > user of __vmalloc passing a page protection flag other than PAGE_KERNEL.
> 
> Well Ben added this to make stuff work on ppc, ofc the home grown dma
> layer in drm from back then isn't going to work in other places. I guess
> should have at least an ack from him, in case anyone still cares about
> this on ppc. Adding Ben to cc.

This was due to some drivers (radeon ?) trying to use vmalloc pages for
coherent DMA, which means on those 4xx powerpc's need to be non-cached.

There were machines using that (440 based iirc), though I honestly
can't tell if anybody still uses any of it.

Cheers,
Ben.

> -Daniel
> 
> > 
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  drivers/gpu/drm/drm_scatter.c | 11 +--
> >  1 file changed, 1 insertion(+), 10 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/drm_scatter.c b/drivers/gpu/drm/drm_scatter.c
> > index ca520028b2cb..f4e6184d1877 100644
> > --- a/drivers/gpu/drm/drm_scatter.c
> > +++ b/drivers/gpu/drm/drm_scatter.c
> > @@ -43,15 +43,6 @@
> >  
> >  #define DEBUG_SCATTER 0
> >  
> > -static inline void *drm_vmalloc_dma(unsigned long size)
> > -{
> > -#if defined(__powerpc__) && defined(CONFIG_NOT_COHERENT_CACHE)
> > -   return __vmalloc(size, GFP_KERNEL, pgprot_noncached_wc(PAGE_KERNEL));
> > -#else
> > -   return vmalloc_32(size);
> > -#endif
> > -}
> > -
> >  static void drm_sg_cleanup(struct drm_sg_mem * entry)
> >  {
> > struct page *page;
> > @@ -126,7 +117,7 @@ int drm_legacy_sg_alloc(struct drm_device *dev, void 
> > *data,
> > return -ENOMEM;
> > }
> >  
> > -   entry->virtual = drm_vmalloc_dma(pages << PAGE_SHIFT);
> > +   entry->virtual = vmalloc_32(pages << PAGE_SHIFT);
> > if (!entry->virtual) {
> > kfree(entry->busaddr);
> > kfree(entry->pagelist);
> > -- 
> > 2.25.1
> > 
> 
>

Re: [PATCH 0/2] powerpc: Remove support for ppc405/440 Xilinx platforms

2020-04-07 Thread Benjamin Herrenschmidt

On Fri, 2020-04-03 at 15:59 +1100, Michael Ellerman wrote:
> Benjamin Herrenschmidt  writes:
> > On Tue, 2020-03-31 at 16:30 +1100, Michael Ellerman wrote:
> > > I have no attachment to 40x, and I'd certainly be happy to have
> > > less
> > > code in the tree, we struggle to keep even the modern platforms
> > > well
> > > maintained.
> > > 
> > > At the same time I don't want to render anyone's hardware
> > > obsolete
> > > unnecessarily. But if there's really no one using 40x then we
> > > should
> > > remove it, it could well be broken already.
> > > 
> > > So I guess post a series to do the removal and we'll see if
> > > anyone
> > > speaks up.
> > 
> > We shouldn't remove 40x completely. Just remove the Xilinx 405
> > stuff.
> 
> Congratulations on becoming the 40x maintainer!

Didn't I give you my last 40x system ? :-) IBM still put 40x cores
inside POWER chips no ?

Cheers,
Ben.

Re: [PATCH v4 03/25] powerpc/powernv: Map & release OpenCAPI LPC memory

2020-04-02 Thread Benjamin Herrenschmidt

On Wed, 2020-04-01 at 01:48 -0700, Dan Williams wrote:
> > 
> > +u64 pnv_ocxl_platform_lpc_setup(struct pci_dev *pdev, u64 size)
> > +{
> > +   struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> > +   struct pnv_phb *phb = hose->private_data;
> 
> Is calling the local variable 'hose' instead of 'host' on purpose?

Haha that's funny :-)

It's an ld usage that comes iirc from sparc ? or maybe alpha ?
I somewhat accidentally picked it up when adding multiple host-bridge
support on powerpc in the early 2000's and it hasn't quite died yet :)

Cheers,
Ben.

Re: [PATCH 0/2] powerpc: Remove support for ppc405/440 Xilinx platforms

2020-04-02 Thread Benjamin Herrenschmidt

On Tue, 2020-03-31 at 16:30 +1100, Michael Ellerman wrote:
> I have no attachment to 40x, and I'd certainly be happy to have less
> code in the tree, we struggle to keep even the modern platforms well
> maintained.
> 
> At the same time I don't want to render anyone's hardware obsolete
> unnecessarily. But if there's really no one using 40x then we should
> remove it, it could well be broken already.
> 
> So I guess post a series to do the removal and we'll see if anyone
> speaks up.

We shouldn't remove 40x completely. Just remove the Xilinx 405 stuff.

Cheers,
Ben.

Re: [PATCH] dt: Remove booting-without-of.txt

2020-03-04 Thread Benjamin Herrenschmidt

On Wed, 2020-03-04 at 12:45 -0600, Rob Herring wrote:
> Well, not quite removed yet... Mauro is looking at moving this to ReST,
> but I think it would be better to trim or remove it.
> 
> boot-without-of.txt is an ancient document that first outlined
> Flattened DeviceTree. The DT world has evolved a lot in the 15 years
> since and boot-without-of.txt is pretty stale. The name of the document
> itself is confusing if you don't understand the evolution from real
> 'OpenFirmware'. Much of what booting-without-of.txt contains is now in
> the DT specification (which evolved out of the ePAPR).
> 
> This is a first pass of removing everything that has a DT spec
> equivalent or is no longer standard practice (e.g. soc for SoC
> nodes) in order to see what's left. This is what I have:
> 
> TODO
> - Move boot interface details to arch specific docs
> - Document 'serial-number' property in DT spec
> - Document the 'hotpluggable' memory property in DT spec
> - Document the 'sleep' property (PPC only)
> - Document the 'dma-coherent' property in DT spec
> - Need the history of node names and 'name' property?
> - Need how addresses work?
> 
> Cc: Frank Rowand 
> Cc: Mauro Carvalho Chehab 

Acked-by: Benjamin Herrenschmidt 

> Cc: Geert Uytterhoeven 
> Cc: Michael Ellerman 
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Rob Herring 
> ---
>  .../devicetree/booting-without-of.txt | 1027 +
>  1 file changed, 1 insertion(+), 1026 deletions(-)
> 
> diff --git a/Documentation/devicetree/booting-without-of.txt 
> b/Documentation/devicetree/booting-without-of.txt
> index 4660ccee35a3..97beee828ba4 100644
> --- a/Documentation/devicetree/booting-without-of.txt
> +++ b/Documentation/devicetree/booting-without-of.txt
> @@ -19,44 +19,17 @@ Table of Contents
>  5) Entry point for arch/sh
>  
>II - The DT block format
> -1) Header
>  2) Device tree generalities
> -3) Device tree "structure" block
> -4) Device tree "strings" block
>  
>III - Required content of the device tree
>  1) Note about cells and address representation
> -2) Note about "compatible" properties
> -3) Note about "name" properties
> -4) Note about node and property names and character set
>  5) Required nodes and properties
>a) The root node
> -  b) The /cpus node
> -  c) The /cpus/* nodes
> -  d) the /memory node(s)
> -  e) The /chosen node
> -  f) the /soc node
> -
> -  IV - "dtc", the device tree compiler
> -
> -  V - Recommendations for a bootloader
> -
> -  VI - System-on-a-chip devices and nodes
> -1) Defining child nodes of an SOC
> -2) Representing devices without a current OF specification
> -
> -  VII - Specifying interrupt information for devices
> -1) interrupts property
> -2) interrupt-parent property
> -3) OpenPIC Interrupt Controllers
> -4) ISA Interrupt Controllers
>  
>VIII - Specifying device power management information (sleep property)
>  
>IX - Specifying dma bus information
>  
> -  Appendix A - Sample SOC node for MPC8540
> -
>  
>  Revision Information
>  
> @@ -105,19 +78,6 @@ Revision Information
>- Added chapter VI
>  
>  
> - ToDo:
> - - Add some definitions of interrupt tree (simple/complex)
> - - Add some definitions for PCI host bridges
> - - Add some common address format examples
> - - Add definitions for standard properties and "compatible"
> -   names for cells that are not already defined by the existing
> -   OF spec.
> - - Compare FSL SOC use of PCI to standard and make sure no new
> -   node definition required.
> - - Add more information about node definitions for SOC devices
> -   that currently have no standard, like the FSL CPM.
> -
> -
>  I - Introduction
>  
>  
> @@ -333,196 +293,17 @@ II - The DT block format
>  
>  
>  
> -This chapter defines the actual format of the flattened device-tree
> -passed to the kernel. The actual content of it and kernel requirements
> -are described later. You can find example of code manipulating that
> -format in various places, including arch/powerpc/kernel/prom_init.c
> -which will generate a flattened device-tree from the Open Firmware
> -representation, or the fs2dt utility which is part of the kexec tools
> -which will generate one from a filesystem representation. It is
> -expected that a bootloader like uboot provides a bit more support,
> -that will be discussed later as well.
> -
>  Note: The block has

Re: vdso function descriptors (VDS64_HAS_DESCRIPTORS)?

2020-02-24 Thread Benjamin Herrenschmidt

On Mon, 2020-02-24 at 10:20 -0500, Joe Lawrence wrote:
> On
> > I don't remember why :-) I think I didn't want to mess with the OPD
> > fixup in glibc back then.
> > 
> 
> Does it make sense to just drop the unused VDS64_HAS_DESCRIPTORS code
> then?

I'd think so yes.

Cheers,
Ben.

Re: vdso function descriptors (VDS64_HAS_DESCRIPTORS)?

2020-02-24 Thread Benjamin Herrenschmidt

On Sat, 2020-02-22 at 18:07 -0600, Segher Boessenkool wrote:
> > 
> > so I don't believe they are ever used by default -- in this case
> > V_FUNCTION_BEGIN doesn't add to the .opd section with .name, .TOC base,
> > etc.
> > 
> > Manually setting VDS64_HAS_DESCRIPTORS results in a vdso64.so in which
> > binutils tools like readelf properly report functions with symbol type
> > FUNC instead of NOTYPE.
> > 
> > Are there pieces of the build/etc toolchain unprepared for function
> > descriptors?  I'm just trying to figure out why the code defaults to
> > unsetting them.
> 
> Because direct calls are faster than indirect calls?  Ben might have a
> fuller explanation, cc:ing him.

I don't remember why :-) I think I didn't want to mess with the OPD
fixup in glibc back then.

Ben.

Re: Onboard SD card doesn't work anymore after the 'mmc-v5.4-2' updates

2019-10-28 Thread Benjamin Herrenschmidt

On Fri, 2019-10-25 at 23:39 -0700, Christoph Hellwig wrote:
> On Fri, Oct 25, 2019 at 05:28:45PM -0500, Rob Herring wrote:
> > This doesn't work?:
> > 
> > if (IS_ENABLED(CONFIG_PPC) || of_dma_is_coherent(dev-
> > >of_node))
> > value |= ESDHC_DMA_SNOOP;
> > else
> > value &= ~ESDHC_DMA_SNOOP;
> > 
> > While I said use the compatibles, using the kconfig symbol is
> > easier
> > than sorting out which compatibles are PPC SoCs. Though if that's
> > already done elsewhere in the driver, you could set a flag and use
> > that here. I'd be surprised if this was the only difference between
> > ARM and PPC SoCs for this block.
> 
> I think the right thing is a Kconfig variable that the architectures
> selects which says if OF is by default coherent or incoherent, and
> then use that in of_dma_is_coherent.

That too. We could also define properties for both ways so we can
override the default either way.

Cheers,
Ben.

Re: Onboard SD card doesn't work anymore after the 'mmc-v5.4-2' updates

2019-10-28 Thread Benjamin Herrenschmidt

On Fri, 2019-10-25 at 17:28 -0500, Rob Herring wrote:
> This doesn't work?:
> 
> if (IS_ENABLED(CONFIG_PPC) || of_dma_is_coherent(dev-
> >of_node))
> value |= ESDHC_DMA_SNOOP;
> else
> value &= ~ESDHC_DMA_SNOOP;

CONFIG_PPC is restrictive. What about sparc64 ? There could be others
.. .we can't suddenly requests people to add new properties for what
was implied behaviours before hand, esp since it's not in the base 1275
spec, no ?

I would suggest of_dma_is_coherent is *true* by default unless
overriden to do something else.

Cheers,
Ben.

Re: Onboard SD card doesn't work anymore after the 'mmc-v5.4-2' updates

2019-10-23 Thread Benjamin Herrenschmidt

On Wed, 2019-10-23 at 16:42 +1100, Michael Ellerman wrote:
> 
> Right, it seems of_dma_is_coherent() has baked in the assumption that
> devices are non-coherent unless explicitly marked as coherent.
> 
> Which is wrong on all or at least most existing powerpc systems
> according to Ben.

This is probably broken on sparc(64) as well and whatever else uses
DT and is an intrinsicly coherent architecture (did we ever have
DT enabled x86s ? Wasn't OLPC such a beast ?).

I think this should have been done the other way around and default to
coherent since most traditional OF platforms are coherent, and you
can't just require those DTs to change.

> > Any ideas from the PPC maintainers?
> 
> Fixing it at the source seems like the best option to prevent future
> breakage.
> 
> So I guess that would mean making of_dma_is_coherent() return true/false
> based on CONFIG_NOT_COHERENT_CACHE on powerpc.
> 
> We could do it like below, which would still allow the dma-coherent
> property to work if it ever makes sense on a future powerpc platform.
> 
> I don't really know any of this embedded stuff well, so happy to take
> other suggestions on how to handle this mess.

Re: [PATCH] powerpc/mm/book3s64/hash: Update 4k PAGE_SIZE kernel mapping

2019-10-16 Thread Benjamin Herrenschmidt

On Wed, 2019-10-16 at 20:30 +1100, Michael Ellerman wrote:
> I think the main reason is that in some configurations we can't use 64K
> pages for MMIO, so the 64K vmalloc mappings and 4K MMIO mappings need to
> be in different segments.
> 
> It's possible that's no longer true on modern configs. But annoyingly I
> can't easily tell because the kernel doesn't know, it defers to firmware
> telling it via the "ibm,pa-features" property and setting
> MMU_FTR_CI_LARGE_PAGE.
> 
> The spec (PAPR) doesn't give us any hints either it just says the bit in
> the device tree means:
> 
>   The value of 1 indicates support for I=1 (cache inhibited) large
>   pages; else not supported.
> 
> Which at least suggests that firmware is allowed to support 64K MMIO
> mappings or not at its whim.
> 
> Hopefully Ben or Paul remember more details than me.

Not 100% certain but I think the HW only started supporting non-4k MMIO
pages with P7 (or was it some variant of P6 ?)

Also it's unclear under what circumstances the proprietary hypervisor
or KVM will allow it. I know at some point KVM didn't...

Cheers,
Ben.

Re: Linux PowerPC, PPC4xx

2019-10-16 Thread Benjamin Herrenschmidt

On Wed, 2019-10-16 at 15:19 +0200, Carlo Pisani wrote:
> hi
> I am a student, I represent a group of friends running a couple of
> opensource projects(1), and we are lost with this(2) problem.
> 
> I wrote here(2) a couple of years ago, we are still working with
> kernel
> 4.11.0 and there is broken support for initializing the PCI.
> 
> The PCI initialization of the PPC405GP seems wrong and every
> kernel >= 2.6.26 is not able to correctly address the PDC20265
> 
> an interesting note is:
> kernel 2.6.26 can be compiled with arch=ppc and arch=powerpc
> 
> when compiled with arch=ppc the promise PDC20265 chip is correctly
> managed; while when compiled with arch=powerpc the PDC20265 is not
> correctly managed
> 
> any idea? advice? help? suggestion?
> a good place to discuss it?

On powermac we have a quirk to force these controllers into native
mode. You can try that.

Look for pmac_pci_fixup_pciata(). You could copy that to
powerpc/pci_32.c along with

DECLARE_PCI_FIXUP_EARLY(PCI_ANY_ID, PCI_ANY_ID, pmac_pci_fixup_pciata);

(maybe rename pmac to ppc) and remove the line that checks for
machine_is(powermac)

Cheers,
Ben.

Re: [PATCH] powerpc/time: use feature fixup in __USE_RTC() instead of cpu feature.

2019-08-26 Thread Benjamin Herrenschmidt

On Mon, 2019-08-26 at 21:41 +1000, Michael Ellerman wrote:
> Christophe Leroy  writes:
> > sched_clock(), used by printk(), calls __USE_RTC() to know
> > whether to use realtime clock or timebase.
> > 
> > __USE_RTC() uses cpu_has_feature() which is initialised by
> > machine_init(). Before machine_init(), __USE_RTC() returns true,
> > leading to a program check exception on CPUs not having realtime
> > clock.
> > 
> > In order to be able to use printk() earlier, use feature fixup.
> > Feature fixups are applies in early_init(), enabling the use of
> > printk() earlier.
> > 
> > Signed-off-by: Christophe Leroy 
> > ---
> >  arch/powerpc/include/asm/time.h | 9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> The other option would be just to make this a compile time decision, eg.
> add CONFIG_PPC_601 and use that to gate whether we use RTC.
> 
> Given how many 601 users there are, maybe 1?, I think that would be a
> simpler option and avoids complicating the code / binary for everyone
> else.

Didn't we ditch 601 support years ago anyway ? We had workaround we
threw out I think...

Cheers,
Ben.

> cheers
> 
> > diff --git a/arch/powerpc/include/asm/time.h 
> > b/arch/powerpc/include/asm/time.h
> > index 54f4ec1f9fab..3455cb54c333 100644
> > --- a/arch/powerpc/include/asm/time.h
> > +++ b/arch/powerpc/include/asm/time.h
> > @@ -42,7 +42,14 @@ struct div_result {
> >  /* Accessor functions for the timebase (RTC on 601) registers. */
> >  /* If one day CONFIG_POWER is added just define __USE_RTC as 1 */
> >  #ifdef CONFIG_PPC_BOOK3S_32
> > -#define __USE_RTC()(cpu_has_feature(CPU_FTR_USE_RTC))
> > +static inline bool __USE_RTC(void)
> > +{
> > +   asm_volatile_goto(ASM_FTR_IFCLR("nop;", "b %1;", %0) ::
> > + "i" (CPU_FTR_USE_RTC) :: l_use_rtc);
> > +   return false;
> > +l_use_rtc:
> > +   return true;
> > +}
> >  #else
> >  #define __USE_RTC()0
> >  #endif
> > -- 
> > 2.13.3

Re: [PATCH] powerpc/vdso32: inline __get_datapage()

2019-08-20 Thread Benjamin Herrenschmidt

On Fri, 2019-08-16 at 14:48 +, Christophe Leroy wrote:
> __get_datapage() is only a few instructions to retrieve the
> address of the page where the kernel stores data to the VDSO.
> 
> By inlining this function into its users, a bl/blr pair and
> a mflr/mtlr pair is avoided, plus a few reg moves.
> 
> The improvement is noticeable (about 55 nsec/call on an 8xx)

Interesting... would be worth doing the same on vdso64 no ?

> vdsotest before the patch:
> gettimeofday:vdso: 731 nsec/call
> clock-gettime-realtime-coarse:vdso: 668 nsec/call
> clock-gettime-monotonic-coarse:vdso: 745 nsec/call
> 
> vdsotest after the patch:
> gettimeofday:vdso: 677 nsec/call
> clock-gettime-realtime-coarse:vdso: 613 nsec/call
> clock-gettime-monotonic-coarse:vdso: 690 nsec/call
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/vdso32/cacheflush.S   | 10 +-
>  arch/powerpc/kernel/vdso32/datapage.S | 29 -
> 
>  arch/powerpc/kernel/vdso32/datapage.h | 12 
>  arch/powerpc/kernel/vdso32/gettimeofday.S | 11 +--
>  4 files changed, 26 insertions(+), 36 deletions(-)
>  create mode 100644 arch/powerpc/kernel/vdso32/datapage.h
> 
> diff --git a/arch/powerpc/kernel/vdso32/cacheflush.S
> b/arch/powerpc/kernel/vdso32/cacheflush.S
> index 7f882e7b9f43..e9453837e4ee 100644
> --- a/arch/powerpc/kernel/vdso32/cacheflush.S
> +++ b/arch/powerpc/kernel/vdso32/cacheflush.S
> @@ -10,6 +10,8 @@
>  #include 
>  #include 
>  
> +#include "datapage.h"
> +
>   .text
>  
>  /*
> @@ -24,14 +26,12 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
>.cfi_startproc
>   mflrr12
>.cfi_register lr,r12
> - mr  r11,r3
> - bl  __get_datapage@local
> + get_datapager10, r0
>   mtlrr12
> - mr  r10,r3
>  
>   lwz r7,CFG_DCACHE_BLOCKSZ(r10)
>   addir5,r7,-1
> - andcr6,r11,r5   /* round low to line bdy */
> + andcr6,r3,r5/* round low to line bdy */
>   subfr8,r6,r4/* compute length */
>   add r8,r8,r5/* ensure we get enough */
>   lwz r9,CFG_DCACHE_LOGBLOCKSZ(r10)
> @@ -48,7 +48,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
>  
>   lwz r7,CFG_ICACHE_BLOCKSZ(r10)
>   addir5,r7,-1
> - andcr6,r11,r5   /* round low to line bdy */
> + andcr6,r3,r5/* round low to line bdy */
>   subfr8,r6,r4/* compute length */
>   add r8,r8,r5
>   lwz r9,CFG_ICACHE_LOGBLOCKSZ(r10)
> diff --git a/arch/powerpc/kernel/vdso32/datapage.S
> b/arch/powerpc/kernel/vdso32/datapage.S
> index 6984125b9fc0..d480d2d4a3fe 100644
> --- a/arch/powerpc/kernel/vdso32/datapage.S
> +++ b/arch/powerpc/kernel/vdso32/datapage.S
> @@ -11,34 +11,13 @@
>  #include 
>  #include 
>  
> +#include "datapage.h"
> +
>   .text
>   .global __kernel_datapage_offset;
>  __kernel_datapage_offset:
>   .long   0
>  
> -V_FUNCTION_BEGIN(__get_datapage)
> -  .cfi_startproc
> - /* We don't want that exposed or overridable as we want other
> objects
> -  * to be able to bl directly to here
> -  */
> - .protected __get_datapage
> - .hidden __get_datapage
> -
> - mflrr0
> -  .cfi_register lr,r0
> -
> - bcl 20,31,data_page_branch
> -data_page_branch:
> - mflrr3
> - mtlrr0
> - addir3, r3, __kernel_datapage_offset-data_page_branch
> - lwz r0,0(r3)
> -  .cfi_restore lr
> - add r3,r0,r3
> - blr
> -  .cfi_endproc
> -V_FUNCTION_END(__get_datapage)
> -
>  /*
>   * void *__kernel_get_syscall_map(unsigned int *syscall_count) ;
>   *
> @@ -53,7 +32,7 @@ V_FUNCTION_BEGIN(__kernel_get_syscall_map)
>   mflrr12
>.cfi_register lr,r12
>   mr  r4,r3
> - bl  __get_datapage@local
> + get_datapager3, r0
>   mtlrr12
>   addir3,r3,CFG_SYSCALL_MAP32
>   cmpli   cr0,r4,0
> @@ -74,7 +53,7 @@ V_FUNCTION_BEGIN(__kernel_get_tbfreq)
>.cfi_startproc
>   mflrr12
>.cfi_register lr,r12
> - bl  __get_datapage@local
> + get_datapager3, r0
>   lwz r4,(CFG_TB_TICKS_PER_SEC + 4)(r3)
>   lwz r3,CFG_TB_TICKS_PER_SEC(r3)
>   mtlrr12
> diff --git a/arch/powerpc/kernel/vdso32/datapage.h
> b/arch/powerpc/kernel/vdso32/datapage.h
> new file mode 100644
> index ..ad96256be090
> --- /dev/null
> +++ b/arch/powerpc/kernel/vdso32/datapage.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
> +.macro get_datapage ptr, tmp
> + bcl 20,31,888f
> +888:
> + mflr\ptr
> + addi\ptr, \ptr, __kernel_datapage_offset - 888b
> + lwz \tmp, 0(\ptr)
> + add \ptr, \tmp, \ptr
> +.endm
> +
> +
> diff --git a/arch/powerpc/kernel/vdso32/gettimeofday.S
> b/arch/powerpc/kernel/vdso32/gettimeofday.S
> index e10098cde89c..91a58f01dcd5 100644
> ---

Re: [PATCH v2 05/12] powerpc/mm: rework io-workaround invocation.

2019-08-20 Thread Benjamin Herrenschmidt

On Wed, 2019-08-21 at 00:28 +0200, Christoph Hellwig wrote:
> On Tue, Aug 20, 2019 at 02:07:13PM +, Christophe Leroy wrote:
> > ppc_md.ioremap() is only used for I/O workaround on CELL platform,
> > so indirect function call can be avoided.
> > 
> > This patch reworks the io-workaround and ioremap() functions to
> > use the global 'io_workaround_inited' flag for the activation
> > of io-workaround.
> > 
> > When CONFIG_PPC_IO_WORKAROUNDS or CONFIG_PPC_INDIRECT_MMIO are not
> > selected, the I/O workaround ioremap() voids and the global flag is
> > not used.
> 
> Note that CONFIG_PPC_IO_WORKAROUNDS is only selected by a specific cell
> config,  and CONFIG_PPC_INDIRECT_MMIO is always selected by cell, so
> I think we can make CONFIG_PPC_IO_WORKAROUNDS depend on
> CONFIG_PPC_INDIRECT_MMIO

Or we can deprecate that old platform... not sure anybody uses it
anymore (if anybody ever did).

Cheers,
ben.

Re: [PATCH] drivers/macintosh/smu.c: Mark expected switch fall-through

2019-07-30 Thread Benjamin Herrenschmidt

On Tue, 2019-07-30 at 10:07 -0700, Kees Cook wrote:
> 
> > Why do we think it's an expected fall through? I can't really
> > convince
> > myself from the surrounding code that it's definitely intentional.
> 
> Yeah, good question. Just now when I went looking for who
> used SMU_I2C_TRANSFER_COMBINED, I found the only caller in
> arch/powerpc/platforms/powermac/low_i2c.c and it is clearly using a
> fall-through for building the command for "stdsub" and "combined",
> so I think that's justification enough:

Yes, sorry for the delay, the fall through is intentional.

Cheers,
Ben.

Re: [PATCH kernel v3] powerpc/xive: Drop deregistered irqs

2019-07-16 Thread Benjamin Herrenschmidt

On Wed, 2019-07-17 at 15:00 +1000, Alexey Kardashevskiy wrote:
> There is a race between releasing an irq on one cpu and fetching it
> from XIVE on another cpu as there does not seem to be any locking between
> these, probably because xive_irq_chip::irq_shutdown() is supposed to
> remove the irq from all queues in the system which it does not do.
> 
> As a result, when such released irq appears in a queue, we take it
> from the queue but we do not change the current priority on that cpu and
> since there is no handler for the irq, EOI is never called and the cpu
> current priority remains elevated (7 vs. 0xff==unmasked). If another irq
> is assigned to the same cpu, then that device stops working until irq
> is moved to another cpu or the device is reset.
> 
> This adds a new ppc_md.orphan_irq callback which is called if no irq
> descriptor is found. The XIVE implementation drops the current priority
> to 0xff which effectively unmasks interrupts in a current CPU.

Better.

Now, you should proably add orphan_irq as a separate patch, and it
wouldn't hurt to make other PICs like XICS also provide it :-) They are
less likely to hit due to the absence of queuing but I suppose the
theorical race exists.

Cheers,
Ben.

> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: Cédric Le Goater 
> ---
> Changes:
> v3:
> * added a comment above xive_orphan_irq()
> 
> v2:
> * added ppc_md.orphan_irq
> ---
>  arch/powerpc/include/asm/machdep.h |  3 +++
>  arch/powerpc/kernel/irq.c  |  9 ++---
>  arch/powerpc/sysdev/xive/common.c  | 18 ++
>  3 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h 
> b/arch/powerpc/include/asm/machdep.h
> index c43d6eca9edd..6cc14e28e89a 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -59,6 +59,9 @@ struct machdep_calls {
>   /* Return an irq, or 0 to indicate there are none pending. */
>   unsigned int(*get_irq)(void);
>  
> + /* Drops irq if it does not have a valid descriptor */
> + void(*orphan_irq)(unsigned int irq);
> +
>   /* PCI stuff */
>   /* Called after allocating resources */
>   void(*pcibios_fixup)(void);
> diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
> index bc68c53af67c..b4e06d05bdba 100644
> --- a/arch/powerpc/kernel/irq.c
> +++ b/arch/powerpc/kernel/irq.c
> @@ -632,10 +632,13 @@ void __do_irq(struct pt_regs *regs)
>   may_hard_irq_enable();
>  
>   /* And finally process it */
> - if (unlikely(!irq))
> + if (unlikely(!irq)) {
>   __this_cpu_inc(irq_stat.spurious_irqs);
> - else
> - generic_handle_irq(irq);
> + } else if (generic_handle_irq(irq)) {
> + if (ppc_md.orphan_irq)
> + ppc_md.orphan_irq(irq);
> + __this_cpu_inc(irq_stat.spurious_irqs);
> + }
>  
>   trace_irq_exit(regs);
>  
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 082c7e1c20f0..17e696b2d71b 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -283,6 +283,23 @@ static unsigned int xive_get_irq(void)
>   return irq;
>  }
>  
> +/*
> + * Handles the case when a target CPU catches an interrupt which is being 
> shut
> + * down on another CPU. generic_handle_irq() returns an error in such case
> + * and then the orphan_irq() handler restores the CPPR to reenable 
> interrupts.
> + *
> + * Without orphan_irq() and valid irq_desc, there is no other way to restore
> + * the CPPR. This executes on a CPU which caught the interrupt.
> + */
> +static void xive_orphan_irq(unsigned int irq)
> +{
> + struct xive_cpu *xc = __this_cpu_read(xive_cpu);
> +
> + xc->cppr = 0xff;
> + out_8(xive_tima + xive_tima_offset + TM_CPPR, 0xff);
> + DBG_VERBOSE("orphan_irq: irq %d, adjusting CPPR to 0xff\n", irq);
> +}
> +
>  /*
>   * After EOI'ing an interrupt, we need to re-check the queue
>   * to see if another interrupt is pending since multiple
> @@ -1419,6 +1436,7 @@ bool __init xive_core_init(const struct xive_ops *ops, 
> void __iomem *area, u32 o
>   xive_irq_priority = max_prio;
>  
>   ppc_md.get_irq = xive_get_irq;
> + ppc_md.orphan_irq = xive_orphan_irq;
>   __xive_enabled = true;
>  
>   pr_devel("Initializing host..\n");

Re: [RFC PATCH] virtio_ring: Use DMA API if guest memory is encrypted

2019-07-15 Thread Benjamin Herrenschmidt

On Mon, 2019-07-15 at 19:03 -0300, Thiago Jung Bauermann wrote:
> > > Indeed. The idea is that QEMU can offer the flag, old guests can
> > > reject
> > > it (or even new guests can reject it, if they decide not to
> > > convert into
> > > secure VMs) and the feature negotiation will succeed with the
> > > flag
> > > unset.
> > 
> > OK. And then what does QEMU do? Assume guest is not encrypted I
> > guess?
> 
> There's nothing different that QEMU needs to do, with or without the
> flag. the perspective of the host, a secure guest and a regular guest
> work the same way with respect to virtio.

This is *precisely* why I was against adding a flag and touch the
protocol negociation with qemu in the first place, back when I cared
about that stuff...

Guys, this has gone in circles over and over again.

This has nothing to do with qemu. Qemu doesn't need to know about this.
It's entirely guest local. This is why the one-liner in virtio was a
far better and simpler solution.

This is something the guest does to itself (with the participation of a
ultravisor but that's not something qemu cares about at this stage, at
least not as far as virtio is concerned).

Basically, the guest "hides" its memory from the host using a HW secure
memory facility. As a result, it needs to ensure that all of its DMA
pages are bounced through insecure pages that aren't hidden. That's it,
it's all guest side. Qemu shouldn't have to care about it at all.

Cheers,
Ben.

Re: [RFC PATCH kernel] powerpc/xive: Drop deregistered irqs

2019-07-14 Thread Benjamin Herrenschmidt

On Sun, 2019-07-14 at 21:44 +0200, Cédric Le Goater wrote:
> > Well, best is probably to do just that though, but call it something
> > like ppc_md.orphan_irq() or something like that instead. Another option
> > as you mention is to try to scrub queues, but that's trickier to do due
> > to the lockless nature of the queue handling.
> 
> When the IRQ is shutdown, couldn't we cleanup the CPU EQ by filtering 
> all the dangling entries, and replacing them with zeroes ? That would
> be alternative 1, but I don't think we need to scan all cpus. The last
> target should be enough.

It's a bit tricky due to the lockless nature of the queues...

Cheers,
Ben.

Re: [RFC PATCH kernel] powerpc/xive: Drop deregistered irqs

2019-07-13 Thread Benjamin Herrenschmidt

On Sat, 2019-07-13 at 18:53 +1000, Alexey Kardashevskiy wrote:
> 
> On 13/07/2019 09:47, Benjamin Herrenschmidt wrote:
> > On Fri, 2019-07-12 at 19:37 +1000, Alexey Kardashevskiy wrote:
> > > 
> > 
> > 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/irq.c#n614
> > > 
> > > If so, then in order to do EOI, I'll need the desc which is gone,
> > > or
> > > I am missing the point?
> > 
> > All you need is drop the local CPU priority.
> 
> I know that, the question was how. I cannot use irq_chip in
> arch/powerpc/kernel/irq.c and I do not want to add
> ppc_md.enable_irqs().

Well, best is probably to do just that though, but call it something
like ppc_md.orphan_irq() or something like that instead. Another option
as you mention is to try to scrub queues, but that's trickier to do due
to the lockless nature of the queue handling.

Cheers,
Ben.

Re: [RFC PATCH kernel] powerpc/xive: Drop deregistered irqs

2019-07-12 Thread Benjamin Herrenschmidt

On Fri, 2019-07-12 at 19:37 +1000, Alexey Kardashevskiy wrote:
> 

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/kernel/irq.c#n614
> 
> If so, then in order to do EOI, I'll need the desc which is gone, or
> I am missing the point?

All you need is drop the local CPU priority.

Cheers,
Ben.

Re: [RFC PATCH kernel] powerpc/xive: Drop deregistered irqs

2019-07-12 Thread Benjamin Herrenschmidt

On Fri, 2019-07-12 at 18:20 +1000, Alexey Kardashevskiy wrote:
> 
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index 082c7e1c20f0..65742e280337 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -148,8 +148,12 @@ static u32 xive_scan_interrupts(struct xive_cpu *xc, 
> bool just_peek)
>   irq = xive_read_eq(>queue[prio], just_peek);
>  
>   /* Found something ? That's it */
> - if (irq)
> - break;
> + if (irq) {
> + /* Another CPU may have shut this irq down, check it */
> + if (irq_to_desc(irq))

What if it gets deregistered here  ?

> + break;
> + irq = 0;
> + }
>  
>   /* Clear pending bits */
>   xc->pending_prio &= ~(1 << prio);

Wouldn't it be better to check the return value from generic_handle_irq
instead ?

Cheers,
Ben.

Re: [PATCH] powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac

2019-06-19 Thread Benjamin Herrenschmidt

On Wed, 2019-06-19 at 22:32 +1000, Michael Ellerman wrote:
> Christoph Hellwig  writes:
> > Any chance this could get picked up to fix the regression?
> 
> Was hoping Ben would Ack it. He's still powermac maintainer :)
> 
> I guess he OK'ed it in the other thread, will add it to my queue.

Yeah ack. If I had written it myself, I would have made the DMA bits a
variable and only set it down to 30 if I see that device in the DT
early on, but I can't be bothered now, if it works, ship it :-)

Note: The patch affects all ppc32, though I don't think it will cause
any significant issue on those who don't need it.

Cheers,
Ben.

> cheers
> 
> > On Thu, Jun 13, 2019 at 10:24:46AM +0200, Christoph Hellwig wrote:
> > > With the strict dma mask checking introduced with the switch to
> > > the generic DMA direct code common wifi chips on 32-bit
> > > powerbooks
> > > stopped working.  Add a 30-bit ZONE_DMA to the 32-bit pmac builds
> > > to allow them to reliably allocate dma coherent memory.
> > > 
> > > Fixes: 65a21b71f948 ("powerpc/dma: remove
> > > dma_nommu_dma_supported")
> > > Reported-by: Aaro Koskinen 
> > > Signed-off-by: Christoph Hellwig 
> > > ---
> > >  arch/powerpc/include/asm/page.h | 7 +++
> > >  arch/powerpc/mm/mem.c   | 3 ++-
> > >  arch/powerpc/platforms/powermac/Kconfig | 1 +
> > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/page.h
> > > b/arch/powerpc/include/asm/page.h
> > > index b8286a2013b4..0d52f57fca04 100644
> > > --- a/arch/powerpc/include/asm/page.h
> > > +++ b/arch/powerpc/include/asm/page.h
> > > @@ -319,6 +319,13 @@ struct vm_area_struct;
> > >  #endif /* __ASSEMBLY__ */
> > >  #include 
> > >  
> > > +/*
> > > + * Allow 30-bit DMA for very limited Broadcom wifi chips on many
> > > powerbooks.
> > > + */
> > > +#ifdef CONFIG_PPC32
> > > +#define ARCH_ZONE_DMA_BITS 30
> > > +#else
> > >  #define ARCH_ZONE_DMA_BITS 31
> > > +#endif
> > >  
> > >  #endif /* _ASM_POWERPC_PAGE_H */
> > > diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> > > index cba29131bccc..2540d3b2588c 100644
> > > --- a/arch/powerpc/mm/mem.c
> > > +++ b/arch/powerpc/mm/mem.c
> > > @@ -248,7 +248,8 @@ void __init paging_init(void)
> > >  (long int)((top_of_ram - total_ram) >> 20));
> > >  
> > >  #ifdef CONFIG_ZONE_DMA
> > > - max_zone_pfns[ZONE_DMA] = min(max_low_pfn, 0x7fffUL
> > > >> PAGE_SHIFT);
> > > + max_zone_pfns[ZONE_DMA] = min(max_low_pfn,
> > > + ((1UL << ARCH_ZONE_DMA_BITS) - 1) >>
> > > PAGE_SHIFT);
> > >  #endif
> > >   max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
> > >  #ifdef CONFIG_HIGHMEM
> > > diff --git a/arch/powerpc/platforms/powermac/Kconfig
> > > b/arch/powerpc/platforms/powermac/Kconfig
> > > index f834a19ed772..c02d8c503b29 100644
> > > --- a/arch/powerpc/platforms/powermac/Kconfig
> > > +++ b/arch/powerpc/platforms/powermac/Kconfig
> > > @@ -7,6 +7,7 @@ config PPC_PMAC
> > >   select PPC_INDIRECT_PCI if PPC32
> > >   select PPC_MPC106 if PPC32
> > >   select PPC_NATIVE
> > > + select ZONE_DMA if PPC32
> > >   default y
> > >  
> > >  config PPC_PMAC64
> > > -- 
> > > 2.20.1
> > 
> > ---end quoted text---

Re: [PATCH kernel] powerpc/pci/of: Parse unassigned resources

2019-06-18 Thread Benjamin Herrenschmidt

On Tue, 2019-06-18 at 22:15 +1000, Michael Ellerman wrote:
> Alexey Kardashevskiy  writes:
> > The pseries platform uses the PCI_PROBE_DEVTREE method of PCI probing
> > which is basically reading "assigned-addresses" of every PCI device.
> > However if the property is missing or zero sized, then there is
> > no fallback of any kind and the PCI resources remain undiscovered, i.e.
> > pdev->resource[] array is empty.
> > 
> > This adds a fallback which parses the "reg" property in pretty much same
> > way except it marks resources as "unset" which later makes Linux assign
> > those resources with proper addresses.
> 
> What happens under PowerVM is the big question.
> 
> ie. if we see such a device under PowerVM and then do our own assignment
> what happens?

May or may not work ... EEH will be probably b0rked, but then it
shouldn't happen.

Basically PowerVM itself doesn't do anything special with PCI. It maps
a whole PHB (or virtual PHB) into the guest and doesn't care much
beyond that for MMIOs.

What you see in Linux getting in the way is RTAS. It's the one
assigning BAR values etc... within that region setup by the HV, but
RTAS is running in the guest, from the HV perspective it's all the same
really.

So if such a device did exist, RTAS would lose track but it would still
work from a HW/HV perspective. RTAS-driven services such as EEH would
probably fail though.

But in practice this shouldn't happen bcs RTAS will set assigned-
addresses on everything.

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-12 Thread Benjamin Herrenschmidt

On Wed, 2019-06-12 at 14:41 -0500, Larry Finger wrote:
> On 6/12/19 1:55 AM, Christoph Hellwig wrote:
> > 
> > Ooops, yes.  But I think we could just enable ZONE_DMA on 32-bit
> > powerpc.  Crude enablement hack below:
> > 
> > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> > index 8c1c636308c8..1dd71a98b70c 100644
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -372,7 +372,7 @@ config PPC_ADV_DEBUG_DAC_RANGE
> >
> >config ZONE_DMA
> >bool
> > - default y if PPC_BOOK3E_64
> > + default y
> >
> >config PGTABLE_LEVELS
> >int
> > 
> 
> With the patch for Kconfig above, and the original patch setting 
> ARCH_ZONE_DMA_BITS to 30, everything works.
> 
> Do you have any ideas on what should trigger the change in ARCH_ZONE_BITS? 
> Should it be CONFIG_PPC32 defined, or perhaps CONFIG_G4_CPU defined?

I think CONFIG_PPC32 is fine

Ben.

Re: [PATCH v2 8/8] habanalabs: enable 64-bit DMA mask in POWER9

2019-06-12 Thread Benjamin Herrenschmidt

On Wed, 2019-06-12 at 09:25 +0300, Oded Gabbay wrote:
> 
> > You can't. Your device is broken. Devices that don't support DMAing to
> > the full 64-bit deserve to be added to the trash pile.
> > 
> 
> Hmm... right know they are added to customers data-centers but what do I know 
> ;)

Well, some customers don't know they are being sold a lemon :)

> > As a result, getting it to work will require hacks. Some GPUs have
> > similar issues and require similar hacks, it's unfortunate.
> > 
> > Added a couple of guys on CC who might be able to help get those hacks
> > right.
> 
> Thanks :)
> > 
> > It's still very fishy .. the idea is to detect the case where setting a
> > 64-bit mask will give your system memory mapped at a fixed high address
> > (1 << 59 in our case) and program that in your chip in the "Fixed high
> > bits" register that you seem to have (also make sure it doesn't affect
> > MSIs or it will break them).
> 
> MSI-X are working. The set of bit 59 doesn't apply to MSI-X
> transactions (AFAICS from the PCIe controller spec we have).

Ok.

> > This will only work as long as all of the system memory can be
> > addressed at an offset from that fixed address that itself fits your
> > device addressing capabilities (50 bits in this case). It may or may
> > not be the case but there's no way to check since the DMA mask logic
> > won't really apply.
> 
> Understood. In the specific system we are integrated to, that is the
> case - we have less then 48 bits. But, as you pointed out, it is not a
> generic solution but with my H/W I can't give a generic fit-all
> solution for POWER9. I'll settle for the best that I can do.
> 
> > 
> > You might want to consider fixing your HW in the next iteration... This
> > is going to bite you when x86 increases the max physical memory for
> > example, or on other architectures.
> 
> Understood and taken care of.

Cheers,
Ben.

> > 
> > Cheers,
> > Ben.
> > 
> > 
> > 
> >

Re: [PATCH v2 8/8] habanalabs: enable 64-bit DMA mask in POWER9

2019-06-12 Thread Benjamin Herrenschmidt

On Wed, 2019-06-12 at 15:45 +1000, Oliver O'Halloran wrote:
> 
> Also, are you sure about the MSI thing? The IODA3 spec says the only
> important bits for a 64bit MSI are bits 61:60 (to hit the window) and
> the lower bits that determine what IVE to use. Everything in between
> is ignored so ORing in bit 59 shouldn't break anything.

On IODA3... could be different on another system. My point is you can't
just have a fixed setting for all top bits for DMA & MSIs.

> > This will only work as long as all of the system memory can be
> > addressed at an offset from that fixed address that itself fits your
> > device addressing capabilities (50 bits in this case). It may or may
> > not be the case but there's no way to check since the DMA mask logic
> > won't really apply.
> > 
> > You might want to consider fixing your HW in the next iteration... This
> > is going to bite you when x86 increases the max physical memory for
> > example, or on other architectures.
> 
> Yes, do this. The easiest way to avoid this sort of wierd hack is to
> just design the PCIe interface to the spec in the first place.

Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-11 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 20:52 -0500, Larry Finger wrote:
> On 6/11/19 5:46 PM, Benjamin Herrenschmidt wrote:
> > On Tue, 2019-06-11 at 17:20 -0500, Larry Finger wrote:
> > > b43-pci-bridge 0001:11:00.0: dma_direct_supported: failed (mask =
> > > 0x3fff,
> > > min_mask = 0x5000/0x5000, dma bits = 0x1f
> > 
> > Ugh ? A mask with holes in it ? That's very wrong... That min_mask is
> > bogus.
> 
> I agree, but that is not likely serious as most systems will have enough 
> memory 
> that the max_pfn term will be much larger than the initial min_mask, and 
> min_mask will be unchanged by the min function. 

Well no... it's too much memory that is the problem. If min_mask is
bogus though it will cause problem later too, so one should look into
it.

> In addition, min_mask is not 
> used beyond this routine, and then only to decide if direct dma is supported. 
> The following patch generates masks with no holes, but I cannot see that it 
> is 
> needed.

The right fix is to round up max_pfn to a power of 2, something like

min_mask = min_t(u64, min_mask, (roundup_pow_of_two(max_pfn - 1)) <<
PAGE_SHIFT) 

> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 2c2772e9702a..e3edd4f29e80 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -384,7 +384,8 @@ int dma_direct_supported(struct device *dev, u64 mask)
>  else
>  min_mask = DMA_BIT_MASK(32);
> 
> -   min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);
> +   min_mask = min_t(u64, min_mask, ((max_pfn - 1) << PAGE_SHIFT) |
> +DMA_BIT_MASK(PAGE_SHIFT));
> 
>  /*
>   * This check needs to be against the actual bit mask value, so
> 
> 
> Larry

Re: [PATCH v2 8/8] habanalabs: enable 64-bit DMA mask in POWER9

2019-06-11 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 20:22 +0300, Oded Gabbay wrote:
> 
> > So, to summarize:
> > If I call pci_set_dma_mask with 48, then it fails on POWER9. However,
> > in runtime, I don't know if its POWER9 or not, so upon failure I will
> > call it again with 32, which makes our device pretty much unusable.
> > If I call pci_set_dma_mask with 64, and do the dedicated configuration
> > in Goya's PCIe controller, then it won't work on x86-64, because bit
> > 59 will be set and the host won't like it (I checked it). In addition,
> > I might get addresses above 50 bits, which my device can't generate.
> > 
> > I hope this makes things more clear. Now, please explain to me how I
> > can call pci_set_dma_mask without any regard to whether I run on
> > x86-64 or POWER9, considering what I wrote above ?
> > 
> > Thanks,
> > Oded
> 
> Adding ppc mailing list.

You can't. Your device is broken. Devices that don't support DMAing to
the full 64-bit deserve to be added to the trash pile.

As a result, getting it to work will require hacks. Some GPUs have
similar issues and require similar hacks, it's unfortunate.

Added a couple of guys on CC who might be able to help get those hacks
right.

It's still very fishy .. the idea is to detect the case where setting a
64-bit mask will give your system memory mapped at a fixed high address
(1 << 59 in our case) and program that in your chip in the "Fixed high
bits" register that you seem to have (also make sure it doesn't affect
MSIs or it will break them).

This will only work as long as all of the system memory can be
addressed at an offset from that fixed address that itself fits your
device addressing capabilities (50 bits in this case). It may or may
not be the case but there's no way to check since the DMA mask logic
won't really apply.

You might want to consider fixing your HW in the next iteration... This
is going to bite you when x86 increases the max physical memory for
example, or on other architectures.

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-11 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 17:20 -0500, Larry Finger wrote:
> b43-pci-bridge 0001:11:00.0: dma_direct_supported: failed (mask =
> 0x3fff, 
> min_mask = 0x5000/0x5000, dma bits = 0x1f

Ugh ? A mask with holes in it ? That's very wrong... That min_mask is
bogus.

Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-11 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 09:54 +0200, Christoph Hellwig wrote:
> On Tue, Jun 11, 2019 at 04:59:54PM +1000, Benjamin Herrenschmidt
> wrote:
> > Ah stupid me ... it's dma_set_mask that failed, since it has no
> > idea
> > that the calling driver is limited to lowmem.
> > 
> > That's also why the "wrong" patch worked.
> > 
> > So yes, a ZONE_DMA at 30-bits will work, though it's somewhat
> > overkill.
> 
> Well, according to Larry it doesn't actually work, which is odd.

Oh I assume that's just a glitch in the patch :-)

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-11 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 16:58 +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2019-06-11 at 08:08 +0200, Christoph Hellwig wrote:
> > On Tue, Jun 11, 2019 at 03:56:33PM +1000, Benjamin Herrenschmidt
> > wrote:
> > > The reason I think it sort-of-mostly-worked is that to get more
> > > than
> > > 1GB of RAM, those machines use CONFIG_HIGHMEM. And *most* network
> > > buffers aren't allocated in Highmem so you got lucky.
> > > 
> > > That said, there is such as thing as no-copy send on network, so I
> > > wouldn't be surprised if some things would still have failed, just
> > > not
> > > frequent enough for you to notice.
> > 
> > Unless NETIF_F_HIGHDMA is set on a netdev, the core networkign code
> > will bounce buffer highmem pages for the driver under all
> > circumstances.
> 
>  ... which b43legacy doesn't set to the best of my knowledge ...
> 
> Which makes me wonder how come it didn't work even with your patches ?
> AFAIK, we have less than 1GB of lowmem unless the config has been
> tweaked

Ah stupid me ... it's dma_set_mask that failed, since it has no idea
that the calling driver is limited to lowmem.

That's also why the "wrong" patch worked.

So yes, a ZONE_DMA at 30-bits will work, though it's somewhat overkill.

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-11 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 08:08 +0200, Christoph Hellwig wrote:
> On Tue, Jun 11, 2019 at 03:56:33PM +1000, Benjamin Herrenschmidt
> wrote:
> > The reason I think it sort-of-mostly-worked is that to get more
> > than
> > 1GB of RAM, those machines use CONFIG_HIGHMEM. And *most* network
> > buffers aren't allocated in Highmem so you got lucky.
> > 
> > That said, there is such as thing as no-copy send on network, so I
> > wouldn't be surprised if some things would still have failed, just
> > not
> > frequent enough for you to notice.
> 
> Unless NETIF_F_HIGHDMA is set on a netdev, the core networkign code
> will bounce buffer highmem pages for the driver under all
> circumstances.

 ... which b43legacy doesn't set to the best of my knowledge ...

Which makes me wonder how come it didn't work even with your patches ?
AFAIK, we have less than 1GB of lowmem unless the config has been
tweaked

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-10 Thread Benjamin Herrenschmidt

On Mon, 2019-06-10 at 13:44 -0500, Larry Finger wrote:
> On 6/7/19 11:21 PM, Benjamin Herrenschmidt wrote:
> > 
> > > Please try the attached patch. I'm not really pleased with it and I will
> > > continue to determine why the fallback to a 30-bit mask fails, but at 
> > > least this
> > > one works for me.
> > 
> > Your patch only makes sense if the device is indeed capable of
> > addressing 31-bits.
> > 
> > So either the driver is buggy and asks for a too small mask in which
> > case your patch is ok, or it's not and you're just going to cause all
> > sort of interesting random problems including possible memory
> > corruption.
> 
> Of course the driver may be buggy, but it asks for the correct mask.
> 
> This particular device is not capable of handling 32-bit DMA. The driver 
> detects 
> the 32-bit failure and falls back to 30 bits. It works on x86, and did on 
> PPC32 
> until 5.1. As Christoph said, it should always be possible to use fewer bits 
> than the maximum.

No, I don't think it *worked* on ppc32 before Christoph patch. I think
it "mostly sort-of worked" :-)

The reason I'm saying that is if your system has more than 1GB of RAM,
then you'll have chunks of memory that the device simply cannot
address.

Before Christoph patches, we had no ZONE_DMA or ZONE_DMA32 covering the
30-bit limited space, so any memory allocation could in theory land
above 30-bits, causing all sort of horrible things to happen with that
driver.

The reason I think it sort-of-mostly-worked is that to get more than
1GB of RAM, those machines use CONFIG_HIGHMEM. And *most* network
buffers aren't allocated in Highmem so you got lucky.

That said, there is such as thing as no-copy send on network, so I
wouldn't be surprised if some things would still have failed, just not
frequent enough for you to notice.

> Similar devices that are new enough to use b43 rather than b43legacy work 
> with 
> new kernels; however, they have and use 32-bit DMA.

Cheres,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-07 Thread Benjamin Herrenschmidt



> Please try the attached patch. I'm not really pleased with it and I will 
> continue to determine why the fallback to a 30-bit mask fails, but at least 
> this 
> one works for me.

Your patch only makes sense if the device is indeed capable of
addressing 31-bits.

So either the driver is buggy and asks for a too small mask in which
case your patch is ok, or it's not and you're just going to cause all
sort of interesting random problems including possible memory
corruption.

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-06 Thread Benjamin Herrenschmidt

On Thu, 2019-06-06 at 20:56 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2019-06-06 at 12:31 +0300, Aaro Koskinen wrote:
> > Hi,
> > 
> > On Thu, Jun 06, 2019 at 10:54:51AM +1000, Benjamin Herrenschmidt
> > wrote:
> > > On Thu, 2019-06-06 at 01:50 +0300, Aaro Koskinen wrote:
> > > > Hi,
> > > > 
> > > > When upgrading from v5.0 -> v5.1 on G4 PowerBook, I noticed WLAN
> > > > does
> > > > not work anymore:
> > > > 
> > > > [   42.004303] b43legacy-phy0: Loading firmware version 0x127,
> > > > patch level 14 (2005-04-18 02:36:27)
> > > > [   42.184837] b43legacy-phy0 debug: Chip initialized
> > > > [   42.184873] b43legacy-phy0 ERROR: The machine/kernel does not
> > > > support the required 30-bit DMA mask
> > > > 
> > > > The same happens with the current mainline.
> > > 
> > > How much RAM do you have ?
> > 
> > The system has 1129 MB RAM. Booting with mem=1G makes it work.
> 
> Wow... that's an odd amount. One thing we could possibly do is add code
> to limit the amount of RAM when we detect that device

Sent too quickly... I mean that *or* force swiotlb at 30-bits on those systems 
based
on detecting the presence of that device in the device-tree.

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-06 Thread Benjamin Herrenschmidt

On Thu, 2019-06-06 at 12:31 +0300, Aaro Koskinen wrote:
> Hi,
> 
> On Thu, Jun 06, 2019 at 10:54:51AM +1000, Benjamin Herrenschmidt
> wrote:
> > On Thu, 2019-06-06 at 01:50 +0300, Aaro Koskinen wrote:
> > > Hi,
> > > 
> > > When upgrading from v5.0 -> v5.1 on G4 PowerBook, I noticed WLAN
> > > does
> > > not work anymore:
> > > 
> > > [   42.004303] b43legacy-phy0: Loading firmware version 0x127,
> > > patch level 14 (2005-04-18 02:36:27)
> > > [   42.184837] b43legacy-phy0 debug: Chip initialized
> > > [   42.184873] b43legacy-phy0 ERROR: The machine/kernel does not
> > > support the required 30-bit DMA mask
> > > 
> > > The same happens with the current mainline.
> > 
> > How much RAM do you have ?
> 
> The system has 1129 MB RAM. Booting with mem=1G makes it work.

Wow... that's an odd amount. One thing we could possibly do is add code
to limit the amount of RAM when we detect that device

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-05 Thread Benjamin Herrenschmidt

On Thu, 2019-06-06 at 01:50 +0300, Aaro Koskinen wrote:
> Hi,
> 
> When upgrading from v5.0 -> v5.1 on G4 PowerBook, I noticed WLAN does
> not work anymore:
> 
> [   42.004303] b43legacy-phy0: Loading firmware version 0x127, patch level 14 
> (2005-04-18 02:36:27)
> [   42.184837] b43legacy-phy0 debug: Chip initialized
> [   42.184873] b43legacy-phy0 ERROR: The machine/kernel does not support the 
> required 30-bit DMA mask
> 
> The same happens with the current mainline.

How much RAM do you have ?

Ben.

> 
> Bisected to:
> 
>   commit 65a21b71f948406201e4f62e41f06513350ca390
>   Author: Christoph Hellwig 
>   Date:   Wed Feb 13 08:01:26 2019 +0100
> 
>   powerpc/dma: remove dma_nommu_dma_supported
> 
>   This function is largely identical to the generic version used
>   everywhere else.  Replace it with the generic version.
> 
>   Signed-off-by: Christoph Hellwig 
>   Tested-by: Christian Zigotzky 
>   Signed-off-by: Michael Ellerman 
> 
> A.

Re: [PATCH kernel] prom_init: Fetch flatten device tree from the system firmware

2019-06-03 Thread Benjamin Herrenschmidt

On Mon, 2019-06-03 at 18:42 -0500, Segher Boessenkool wrote:
> Hi!
> 
> On Mon, Jun 03, 2019 at 12:56:26PM +1000, Alexey Kardashevskiy wrote:
> > On 03/06/2019 09:23, Segher Boessenkool wrote:
> > > > So I go for the simple one and agree with Alexey's idea.
> > > 
> > > When dealing with a whole device tree you have to know about the
> > > various
> > > dynamically generated nodes and props, and handle each
> > > appropriately.
> > 
> > The code I am changing fetches the device tree and build an fdt.
> > What is
> > that special knowledge in this context you are talking about?
> 
> Things like /options are dynamically generated.

They are generated before we do the call to extract the fdt, what's the
problem there ?

> I thought you would just be able to reuse some FDT parsing code to
> implement my suggested new interface.  Maybe that was too optimistic.

No, the idea is to have SLOF re-flatten it's live tree and hand us a
blob. At least that's my understanding.

Ben.

Re: [PATCH kernel] prom_init: Fetch flatten device tree from the system firmware

2019-06-03 Thread Benjamin Herrenschmidt

On Mon, 2019-06-03 at 12:56 +1000, Alexey Kardashevskiy wrote:
> 
> > That is all you need if you do not want to use OF at all.
> 
> ? We also need OF drivers to boot grub and the system, and a default
> console for early booting, but yes, we do not want to keep using slof
> that much.
> 
> > If you *do* want to keep having an Open Firmware, what we want or need
> > is a faster way to walk huge device trees.
> 
> Why? We do not need to _walk_ the tree at all (we _have_ to now and
> while walking we do nothing but pack everything together into the fdt
> blob) as slof can easily do this already.

I agree with Alexey. In a sense this can be thought as an extension of
"quiesce", which effectively kills OF.

Yes we could make property fetching faster but mostly by creating a new
bulk interface which is quite a bit of work, a new API, and will in
practice not be used for anything other than creating the FDT. In that
case, nothing will beat in performance having OF create the FDT itself
based on its current tree.

> > > There is no use for the "fetch all properties" cases other than
> > > building an FDT that any of us can think of, and it would create a more
> > > complicated interface than just "fetch an FDT".
> > 
> > It is a simple way to speed up fetching the device tree enormously,
> > without needing big changes to either OF or the clients using it -- not
> > in the code, but importantly also not conceptually: everything works just
> > as before, just a lot faster.
> 
> I can safely presume though that this will never ever be used in
> practice. And it will be still slower, a tiny bit. And require new code
> in both slof and linux.

Correct.

> I can rather see us getting rid of SLOF in the future which in turn will
> require the fdt blob pointer in r5 (or whatever it is, forgot) when the
> guest starts; so "fetch-all-props" won't be used there either.
> 
> > > So I go for the simple one and agree with Alexey's idea.
> > 
> > When dealing with a whole device tree you have to know about the various
> > dynamically generated nodes and props, and handle each appropriately.
> 
> The code I am changing fetches the device tree and build an fdt. What is
> that special knowledge in this context you are talking about?

Ben.

Re: [PATCH kernel] prom_init: Fetch flatten device tree from the system firmware

2019-05-30 Thread Benjamin Herrenschmidt

On Thu, 2019-05-30 at 14:37 -0500, Segher Boessenkool wrote:
> On Thu, May 30, 2019 at 05:09:06PM +1000, Alexey Kardashevskiy wrote:
> > so, it is sort-of nack from David and sort-of ack from Segher, what
> > happens now?
> 
> Maybe what we really need just a CI call to get all properties of a node
> at once?  Will that speed up things enough?
> 
> That way you need no change at all in lifetime of properties and how they
> are used, etc.; just a client getting the properties is a lot faster.

Hrm... if we're going to create a new interface, let's go for what we
need.

What we need is the FDT. It's a rather ubiquitous thing these days, it
makes sense to have a way to fetch an FDT directly from FW.

There is no use for the "fetch all properties" cases other than
building an FDT that any of us can think of, and it would create a more
complicated interface than just "fetch an FDT".

So I go for the simple one and agree with Alexey's idea.

Cheers,
Ben.

Re: [PATCH 1/2] [PowerPC] Add simd.h implementation

2019-05-13 Thread Benjamin Herrenschmidt

On Mon, 2019-05-13 at 22:44 -0300, Shawn Landden wrote:
> +
> +/*
> + * Were we in user mode when we were
> + * interrupted?
> + *
> + * Doing kernel_altivec/vsx_begin/end() is ok if we are running
> + * in an interrupt context from user mode - we'll just
> + * save the FPU state as required.
> + */
> +static bool interrupted_user_mode(void)
> +{
> +   struct pt_regs *regs = get_irq_regs();
> +
> +   return regs && user_mode(regs);
> +}
> +

That's interesting  it *could* work but we'll have to careful audit
the code to make sure thats ok.

We probably also want to handle the case where the CPU is in the idle
loop.

Do we always save the user state when switching out these days ? If
yes, then there's no "live" state to worry about...

Cheers,
Ben.

Re: [QUESTION] powerpc, libseccomp, and spu

2019-02-11 Thread Benjamin Herrenschmidt

On Mon, 2019-02-11 at 11:54 -0700, Tom Hromatka wrote:
> PowerPC experts,
> 
> Paul Moore and I are working on the v2.4 release of libseccomp,
> and as part of this work I need to update the syscall table for
> each architecture.
> 
> I have incorporated the new ppc syscall.tbl into libseccomp, but
> I am not familiar with the value of "spu" in the ABI column.  For
> example:
> 
> 2232  umount  sys_oldumount
> 2264  umount  sys_ni_syscall
> 22spu umount  sys_ni_syscall
> 
> In libseccomp, we maintain a 32-bit ppc syscall table and a 64-bit
> ppc syscall table.  Do we also need to add a "spu" ppc syscall
> table?  Some clarification on the syscalls marked "spu" and "nospu"
> would be greatly appreciated.

On the Cell processor, there is a number of little co-processors (SPUs)
that run alongside the main PowerPC core. Userspace can run code on
them, they operate within the user context via their own MMUs. We
provide a facility for them to issue syscalls (via some kind of RPC to
the main core). The "SPU" indication indicates syscalls that can be
called from the SPUs via that mechanism.

Now, the big question is, anybody still using Cell ? :-)

Cheers,
Ben.

Re: [QUESTION] powerpc, libseccomp, and spu

2019-02-11 Thread Benjamin Herrenschmidt

On Mon, 2019-02-11 at 11:54 -0700, Tom Hromatka wrote:
> PowerPC experts,
> 
> Paul Moore and I are working on the v2.4 release of libseccomp,
> and as part of this work I need to update the syscall table for
> each architecture.
> 
> I have incorporated the new ppc syscall.tbl into libseccomp, but
> I am not familiar with the value of "spu" in the ABI column.  For
> example:
> 
> 2232  umount  sys_oldumount
> 2264  umount  sys_ni_syscall
> 22spu umount  sys_ni_syscall
> 
> In libseccomp, we maintain a 32-bit ppc syscall table and a 64-bit
> ppc syscall table.  Do we also need to add a "spu" ppc syscall
> table?  Some clarification on the syscalls marked "spu" and "nospu"
> would be greatly appreciated.

On the Cell processor, there is a number of little co-processors (SPUs)
that run alongside the main PowerPC core. Userspace can run code on
them, they operate within the user context via their own MMUs. We
provide a facility for them to issue syscalls (via some kind of RPC to
the main core). The "SPU" indication indicates syscalls that can be
called from the SPUs via that mechanism.

Now, the big question is, anybody still using Cell ? :-)

Cheers,
Ben.

Re: [PATCH v1 03/16] powerpc/32: move LOAD_MSR_KERNEL() into head_32.h and use it

2019-02-11 Thread Benjamin Herrenschmidt

On Mon, 2019-02-11 at 07:26 +0100, Christophe Leroy wrote:
> 
> Le 11/02/2019 à 01:21, Benjamin Herrenschmidt a écrit :
> > On Fri, 2019-02-08 at 12:52 +, Christophe Leroy wrote:
> > >   /*
> > > + * MSR_KERNEL is > 0x8000 on 4xx/Book-E since it include MSR_CE.
> > > + */
> > > +.macro __LOAD_MSR_KERNEL r, x
> > > +.if \x >= 0x8000
> > > +   lis \r, (\x)@h
> > > +   ori \r, \r, (\x)@l
> > > +.else
> > > +   li \r, (\x)
> > > +.endif
> > > +.endm
> > > +#define LOAD_MSR_KERNEL(r, x) __LOAD_MSR_KERNEL r, x
> > > +
> > 
> > You changed the limit from >= 0x1 to >= 0x8000 without a
> > corresponding explanation as to why...
> 
> Yes, the existing LOAD_MSR_KERNEL() was buggy because 'li' takes a 
> signed u16, ie between -0x8000 and 0x7999.

Ah yes, I was only looking at the "large" case which is fine...

> By chance it was working because until now nobody was trying to set 
> MSR_KERNEL | MSR_EE.
> 
> Christophe

Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device

2019-02-10 Thread Benjamin Herrenschmidt

On Mon, 2019-02-11 at 13:38 +1100, David Gibson wrote:
> 
> 1) All in kernel
> 
> The offset always maps directly to guest irq number and the kernel
> somehow binds it either to an IPI or a host irq as necessary.
> Cédric's original code attempts this, but the mechanism of keeping a
> pointer to the VMA can't work.

Why do you need a pointer to the VMA anyway ? unmap_mapping_range()
doesn't need a VMA for the unmap part, and faults/mmaps have the VMA.

> But.. remapping the irqs should be sufficiently infrequent that it
> might be ok to consider simply stepping through all the hosting
> process's VMAs to do this.

Which unmap_mapping_range() does for you as I explained previously. You
only need the address space. See how spufs does it (among others).

> 2) Remapped in qemu (using memory regions)
> 
> I _think_ (in hindsight) was Cédric's been discussing as the
> alternative in more recent posts.
> 
> Qemu maps the IPI pages at one place and the passthrough IRQ pages
> somewhere else.  The IPIs are mapped into the guest as one memory
> region, then any passthrough IRQ pages are mapped over that using
> overlapping memory regions.
> 
> I don't think this approach will work well, because it could require a
> bunch of separate KVM memory slots, which are fairly scarce.
> 
> 3) Remapped in qemu (using mmap())
> 
> This is the approach I (and I think Paul) have been suggested in
> contrast to (1).
> 
> Qemu maps the IPI pages and maps those into the guest.  When we need
> to set up a passthrough IRQ, qemu mmap()s its pages directly over the
> IPI pages, and it remains mapped into the guest with the same memory
> region / memslot as the IPIs are already using.  If the passthrough
> device is removed we have to remap the IPI pages back into place.
> 
> 4) Dedicated irq numbers
> 
> We never re-use regular guest irq numbers for passthrough irqs,
> instead we put them somewhere else and keep those mapped to the
> passthrough irq pages.
> 
> I was favouring this approach, but it does mean there will be a guest
> visible difference between kernel_irqchip=on and off which isn't
> great.
> 
> 
> (1) is the most elegant _interface_, but as we've seen it's
> problematic to implement.  Looking at the for_all_vmas() approach
> could be interesting, but otherwise option (3) might be the most
> practical.
> 
> --

Re: [PATCH] powerpc: fix 32-bit KVM-PR lockup and panic with MacOS guest

2019-02-10 Thread Benjamin Herrenschmidt

On Fri, 2019-02-08 at 14:51 +, Mark Cave-Ayland wrote:
> 
> Indeed, but there are still some questions to be asked here:
> 
> 1) Why were these bits removed from the original bitmask in the first place 
> without
> it being documented in the commit message?
> 
> 2) Is this the right fix? I'm told that MacOS guests already run without this 
> patch
> on a G5 under 64-bit KVM-PR which may suggest that this is a workaround for 
> another
> bug elsewhere in the 32-bit powerpc code.
> 
> 
> If you think that these points don't matter, then I'm happy to resubmit the 
> patch
> as-is based upon your comments above.

We should write a test case to verify that FE0/FE1 are properly
preserved/context-switched etc... I bet if we accidentally wiped them,
we wouldn't notice 99.9% of the time.

Cheers,
Ben.

Re: [PATCH v1 03/16] powerpc/32: move LOAD_MSR_KERNEL() into head_32.h and use it

2019-02-10 Thread Benjamin Herrenschmidt

On Fri, 2019-02-08 at 12:52 +, Christophe Leroy wrote:
>  /*
> + * MSR_KERNEL is > 0x8000 on 4xx/Book-E since it include MSR_CE.
> + */
> +.macro __LOAD_MSR_KERNEL r, x
> +.if \x >= 0x8000
> +   lis \r, (\x)@h
> +   ori \r, \r, (\x)@l
> +.else
> +   li \r, (\x)
> +.endif
> +.endm
> +#define LOAD_MSR_KERNEL(r, x) __LOAD_MSR_KERNEL r, x
> +

You changed the limit from >= 0x1 to >= 0x8000 without a
corresponding explanation as to why...

Ben.

Re: [RFC/WIP] powerpc: Fix 32-bit handling of MSR_EE on exceptions

2019-02-05 Thread Benjamin Herrenschmidt

On Tue, 2019-02-05 at 10:45 +0100, Christophe Leroy wrote:
> > > I tested it on the 8xx with the below changes in addition. No issue seen
> > > so far.
> > 
> > Thanks !
> > 
> > I'll merge that in.
> 
> I'm currently working on a refactorisation and simplification of
> exception and syscall entry on ppc32.
> 
> I plan to take your patch in my serie as it helps quite a bit. I hope 
> you don't mind. I expect to come out with a series this week.

Ah ok, you want to take over the series then ? We still need to convert
all the other CPU variants... to be honest I've been distracted, and
taking some time off. I'll be leaving IBM by the end of next week, so I
don't really see myself finishing this work properly.

> > The main obscure area is that business with the irqsoff tracer and thus
> > the need to create stack frames around calls to trace_hardirqs_* ... we
> > do it in some places and not others, but I've not managed to make it
> > crash either. I need to get to the bottom of that, and possibly provide
> > proper macro helpers like ppc64 has to do it.
> 
> I can't see anything special around this in ppc32 code. As far as I 
> understand, a stack frame is put in place when there is a need to
> save and restore some volatile registers. At the places where nothing 
> needs to be saved, nothing is done. I think that's the normal way for 
> any function call, isn't it ?

Not exactly. There's an issue with one of the tracers using
__bultin_return_address(1) which can crash afaik if we don't have
"enough" stack frames on the stack, so there are cases where we need to
create one explicitly around the tracing calls bcs there's only one on
the actual stack.

I don't know the full details, I was planning on doing a bunch of tests
in sim to figure out exactly what happens and what needs to be done
(and whether our existing code is correct or not), but didn't get to it
so far.

Cheers,
Ben.

Re: Does SMP work at all on 40x ?

2019-01-31 Thread Benjamin Herrenschmidt

On Wed, 2019-01-30 at 08:16 +0100, Christophe Leroy wrote:
> In transfer_to_handler() (entry_32.S), we have:
> 
> #if defined(CONFIG_40x) || defined(CONFIG_BOOKE)
> ...
> #ifdef CONFIG_SMP
>   CURRENT_THREAD_INFO(r9, r1)
>   lwz r9,TI_CPU(r9)
>   slwir9,r9,3
>   add r11,r11,r9
> #endif
> #endif
> 
> When running this piece of code, MMU translation is off. But r9 contains 
> the virtual addr of current_thread_info, so unless I miss something, 
> this cannot work on the 40x, can it ?
> 
> On CONFIG_BOOKE it works because phys addr = virt addr

There is no 40x SMP that I am aware of.

Cheers,
Ben.

Re: [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls

2019-01-25 Thread Benjamin Herrenschmidt

On Tue, 2019-01-22 at 16:23 +1100, Paul Mackerras wrote:
> 
> Which ones of these could be implemented in QEMU?  Are there any that
> can't possibly be implemented in QEMU because they need to do things
> that require calling internal interfaces that userspace doesn't have
> access to?

Implementing them in qemu doesn't make a lot of sense. Qemu doesn't
have access to most of the XIVE HW state. There's a XIVE model for full
emulation but when using the real thing, almost none of it is visible.
A lot of those hcalls effectively turn into OPAL calls.

> How often do we expect each of these hypercalls to be called?

It depends, I can't tell for AIX. For Linux, not often with one
exception: H_INT_ESB which will be used whenever you EOI an emulated
LSI.

 .../...

> Why do we need to provide real-mode versions of these hypercall
> handlers?  I thought these hypercalls would only get called
> infrequently, and in any case certainly much less frequently than once
> per interrupt delivered.  If they are infrequent, then let's leave out
> the real-mode version and just handle them in book3s_hv.c.

Agreed with the exception maybe of H_INT_ESB

> > @@ -5153,6 +5169,19 @@ static unsigned int default_hcall_list[] = {
> >H_IPOLL,
> >H_XIRR,
> >H_XIRR_X,
> > +#endif
> > +#ifdef CONFIG_KVM_XIVE
> > + H_INT_GET_SOURCE_INFO,
> > + H_INT_SET_SOURCE_CONFIG,
> > + H_INT_GET_SOURCE_CONFIG,
> > + H_INT_GET_QUEUE_INFO,
> > + H_INT_SET_QUEUE_CONFIG,
> > + H_INT_GET_QUEUE_CONFIG,
> > + H_INT_SET_OS_REPORTING_LINE,
> > + H_INT_GET_OS_REPORTING_LINE,
> > + H_INT_ESB,
> > + H_INT_SYNC,
> > + H_INT_RESET,
> >   #endif
> 
> The policy is not to add new hcalls to default_hcall_list[].  Is there
> a strong reason for adding them here?
> 
> Paul.

Re: [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls

2019-01-25 Thread Benjamin Herrenschmidt

On Wed, 2019-01-23 at 21:26 +1100, Paul Mackerras wrote:
> If H_INT_ESB is only used for LSIs, then is a guest going to be using
> it at all?  

*emulated* LSIs, ie LSIs coming from emulated devices. It will depends
in practice of what kind of emulated device you put in your guest. We
need that because under the hood, we send a XIVE MSI, so we need to be
notified of the EOI so we can resend if the emulated LSI is still
asserted. 

> My understanding was that with XIVE, only a small number
> of interrupts that are to do with system management functions are
> LSIs; all of the interrupts relating to PCI-e devices are MSIs.  So do
> we actually have a real high-frequency use case for LSIs in a guest?
> 
> For now I would prefer that you remove all the real-mode hcall
> handlers.  We can add them later if we get performance data showing
> that they are needed.
> 
> Regarding whether or not to have a given hcall handler in the kernel
> at all - if there is for example an hcall which is just called once
> on guest startup, and its function is just to provide information to
> the guest, and QEMU has that information, then why not have that hcall
> implemented by QEMU?  Are any of the hcalls like that?
> 
> For example, if H_INT_GET_SOURCE_INFO was implemented in QEMU, could
> we then remove the VC_BASE thing from the xive device?

Ben.

Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode

2019-01-25 Thread Benjamin Herrenschmidt

On Wed, 2019-01-23 at 20:07 +0100, Cédric Le Goater wrote:
>  Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry)
> 
> All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and
> Linux should be adjusted ...

All the names changed between the HW design and the "architecture"
document. The HW guys use the old names, the architecture the new
names, and Linux & OPAL mostly use the old ones because frankly the new
names suck big time.

> It would be good to talk a little about the nested support (offline 
> may be) to make sure that we are not missing some major interface that 
> would require a lot of change. If we need to prepare ground, I think
> the timing is good.
> 
> The size of the IRQ number space might be a problem. It seems we 
> would need to increase it considerably to support multiple nested 
> guests. That said I haven't look much how nested is designed.  

The size of the VP space is a bigger concern. Even today. We really
need qemu to tell the max #cpu to KVM so we can allocate less of them.

As for nesting, I suggest for the foreseeable future we stick to XICS
emulation in nested guests.

Cheers,
Ben.

Re: [PATCH 19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl

2019-01-25 Thread Benjamin Herrenschmidt

On Wed, 2019-01-23 at 19:39 +0100, Cédric Le Goater wrote:
> > The reason I ask is that we will have to be much more careful about
> > memory allocation lifetimes with this patch. 
> 
> yes. bad refcounting will lead the host kernel to a crash. 

One way to alleviate that is to make sure this is only supported on
selected devices such as XICS via some flag or the presence of a
callback.

Cheers,
Ben.

Re: BUG: memcmp(): Accessing invalid memory location

2019-01-24 Thread Benjamin Herrenschmidt

On Thu, 2019-01-24 at 19:48 +0530, Chandan Rajendra wrote:
> - Here we execute "LD rB,0,r4". In the case of this bug, r4 has an unaligned
>   value and hence ends up accessing the "next" double word. The "next" double
>   word happens to occur after the last page mapped into the kernel's address
>   space and hence this leads to the previously listed oops.
>   

This is interesting ... should we mark the last page of any piece of
mapped linear mapping as reserved to avoid that sort of issue ?

Nick ? Aneesh ?

Cheers,
Ben.

Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-23 Thread Benjamin Herrenschmidt

On Wed, 2019-01-23 at 21:30 +1100, Paul Mackerras wrote:
> > Afaik bcs we change the mapping to point to the real HW irq ESB page
> > instead of the "IPI" that was there at VM init time.
> 
> So that makes it sound like there is a whole lot going on that hasn't
> even been hinted at in the patch descriptions...  It sounds like we
> need a good description of how all this works and fits together
> somewhere under Documentation/.
> 
> In any case we need much more informative patch descriptions.  I
> realize that it's all currently in Cedric's head, but I bet that in
> two or three years' time when we come to try to debug something, it
> won't be in anyone's head...

The main problem is understanding XIVE itself. It's not realistic to
ask Cedric to write a proper documentation for XIVE as part of the
patch series, but sadly IBM doesn't have a good one to provide either.

Ben.

Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-22 Thread Benjamin Herrenschmidt

On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
> > Clear the ESB pages from the VMA of the IRQ being pass through to the
> > guest and let the fault handler repopulate the VMA when the ESB pages
> > are accessed for an EOI or for a trigger.
> 
> Why do we want to do this?
> 
> I don't see any possible advantage to removing the PTEs from the
> userspace mapping.  You'll need to explain further.

Afaik bcs we change the mapping to point to the real HW irq ESB page
instead of the "IPI" that was there at VM init time.

Cedric, is that correct ?

Cheers,
Ben.

Re: G5 Quad hangs early on 4.20.2 / 5.0-rc2+

2019-01-17 Thread Benjamin Herrenschmidt

On Thu, 2019-01-17 at 10:42 +0100, Tobias Ulmer wrote:
> On Wed, Jan 16, 2019 at 12:15:14PM +1100, Benjamin Herrenschmidt wrote:
> > On Tue, 2019-01-15 at 23:49 +0100, Tobias Ulmer wrote:
> > > Hi,
> > > 
> > > both the latest stable 4.20.2 and 5.0 rc2+ hang early on the G5 Quad.
> > > 
> > > Surely I'm not the first to run into this, but I couldn't find any
> > > discussion or bug report. Sorry if you're already aware.
> > > 
> > > You can see it hang here (5.0 rc2+, 4.20.2 is nearly identical) until
> > > the watchdog triggers a reboot:
> > > 
> > > https://i.imgur.com/UiCVRuG.jpg
> > > 
> > > If I had to make an uneducated guess, it seems to boot into the same
> > > codepath twice (mpic was already initialized, then it starts again right
> > > after smp bringup). Maybe on a second CPU?
> > > 
> > > To narrow it down a little, my last known good was 4.18.9
> > 
> > I don't think it's an MPIC related problem but it does appear to hang
> > about when interrupts get turned on.
> 
> When they get turned on for the second time, for some reason. You can see the
> end of the first time just on top of the screen.

No, that top of screen init is something else.

> It repeats part of the startup initialization right after it's done with
> smp bringup.

That's just the BootX console hanging over to the main console and
replaying the messages I think.

> > I have one of these critters in the office, but I'm working remotely
> > this week so I won't be able to dig into this until next week.
> > 
> > It might help if you could bisect in the meantime.
> 
> I'm bisecting it now, but it's slow going since I don't have much time
> to babysit the machine. The problem shows up somewhere between v4.19 and
> v4.20.

Ok, thanks.

I'll be back on monday or tuesday, let me know where you got up to then
and I'll take it from there. Also email me your .config please.

Cheers,
Ben.

> > Cheers,
> > Ben.
> > 
> >

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 7203 matches

Mail list logo