RE: [RFC v1 1/3] mm/mmu_notifier: Add a new notifier for mapping updates (new pages)

2023-07-24 Thread Hugh Dickins
On Mon, 24 Jul 2023, Kasireddy, Vivek wrote:
> Hi Jason,
> > On Mon, Jul 24, 2023 at 07:54:38AM +, Kasireddy, Vivek wrote:
> > 
> > > > I'm not at all familiar with the udmabuf use case but that sounds
> > > > brittle and effectively makes this notifier udmabuf specific right?
> > > Oh, Qemu uses the udmabuf driver to provide Host Graphics components
> > > (such as Spice, Gstreamer, UI, etc) zero-copy access to Guest created
> > > buffers. In other words, from a core mm standpoint, udmabuf just
> > > collects a bunch of pages (associated with buffers) scattered inside
> > > the memfd (Guest ram backed by shmem or hugetlbfs) and wraps
> > > them in a dmabuf fd. And, since we provide zero-copy access, we
> > > use DMA fences to ensure that the components on the Host and
> > > Guest do not access the buffer simultaneously.
> > 
> > So why do you need to track updates proactively like this?
> As David noted in the earlier series, if Qemu punches a hole in its memfd
> that goes through pages that are registered against a udmabuf fd, then
> udmabuf needs to update its list with new pages when the hole gets
> filled after (guest) writes. Otherwise, we'd run into the coherency 
> problem (between udmabuf and memfd) as demonstrated in the selftest
> (patch #3 in this series).

Wouldn't this all be very much better if Qemu stopped punching holes there?

Hugh


Re: [PATCH] udmabuf: revert 'Add support for mapping hugepages (v4)'

2023-06-13 Thread Hugh Dickins
On Tue, 13 Jun 2023, David Hildenbrand wrote:
> On 13.06.23 10:26, Kasireddy, Vivek wrote:
> >> On 12.06.23 09:10, Kasireddy, Vivek wrote:
> >>> Sorry for the late reply; I just got back from vacation.
> >>> If it is unsafe to directly use the subpages of a hugetlb page, then
> >>> reverting
> >>> this patch seems like the only option for addressing this issue
> >>> immediately.
> >>> So, this patch is
> >>> Acked-by: Vivek Kasireddy 
> >>>
> >>> As far as the use-case is concerned, there are two main users of the
> >> udmabuf
> >>> driver: Qemu and CrosVM VMMs. However, it appears Qemu is the only
> >> one
> >>> that uses hugetlb pages (when hugetlb=on is set) as the backing store for
> >>> Guest (Linux, Android and Windows) system memory. The main goal is to
> >>> share the pages associated with the Guest allocated framebuffer (FB) with
> >>> the Host GPU driver and other components in a zero-copy way. To that
> >> end,
> >>> the guest GPU driver (virtio-gpu) allocates 4k size pages (associated with
> >>> the FB) and pins them before sharing the (guest) physical (or dma)
> >> addresses
> >>> (and lengths) with Qemu. Qemu then translates the addresses into file
> >>> offsets and shares these offsets with udmabuf.
> >>
> >> Is my understanding correct, that we can effectively long-term pin
> >> (worse than mlock) 64 MiB per UDMABUF_CREATE, allowing eventually !root
> > The 64 MiB limit is the theoretical upper bound that we have not seen hit in
> > practice. Typically, for a 1920x1080 resolution (commonly used in Guests),
> > the size of the FB is ~8 MB (1920x1080x4). And, most modern Graphics
> > compositors flip between two FBs.
> > 
> 
> Okay, but users with privileges to open that file can just create as many as
> they want? I think I'll have to play with it.
> 
> >> users
> >>
> >> ll /dev/udmabuf
> >> crw-rw 1 root kvm 10, 125 12. Jun 08:12 /dev/udmabuf
> >>
> >> to bypass there effective MEMLOCK limit, fragmenting physical memory and
> >> breaking swap?
> > Right, it does not look like the mlock limits are honored.
> > 
> 
> That should be added.

Agreed.

> 
> >>
> >> Regarding the udmabuf_vm_fault(), I assume we're mapping pages we
> >> obtained from the memfd ourselves into a special VMA (mmap() of the
> > mmap operation is really needed only if any component on the Host needs
> > CPU access to the buffer. But in most scenarios, we try to ensure direct GPU
> > access (h/w acceleration via gl) to these pages.
> > 
> >> udmabuf). I'm not sure how well shmem pages are prepared for getting
> >> mapped by someone else into an arbitrary VMA (page->index?).
> > Most drm/gpu drivers use shmem pages as the backing store for FBs and
> > other buffers and also provide mmap capability. What concerns do you see
> > with this approach?
> 
> Are these mmaping the pages the way udmabuf maps these pages (IOW, on-demand
> fault where we core-mm will adjust the mapcount etc)?
> 
> Skimming over at shmem_read_mapping_page() users, I assume most of them use a
> VM_PFNMAP mapping (or don't mmap them at all), where we won't be messing with
> the struct page at all.
> 
> (That might even allow you to mmap hugetlb sub-pages, because the struct page
> -- and mapcount -- will be ignored completely and not touched.)

You're well ahead of me: I didn't reach an understanding of whether or not
mapcount would get manipulated here - though if Junxiao's original patch
did fix the immediate hugetlb symptoms, presumably it is (and without much
point, since udmabuf holds on to that extra reference which pins each
page for the duration).

> 
> > 
> >>
> >> ... also, just imagine someone doing FALLOC_FL_PUNCH_HOLE / ftruncate()
> >> on the memfd. What's mapped into the memfd no longer corresponds to
> >> what's pinned / mapped into the VMA.
> > IIUC, making use of the DMA_BUF_IOCTL_SYNC ioctl would help with any
> > coherency issues:
> > https://www.kernel.org/doc/html/v6.2/driver-api/dma-buf.html#c.dma_buf_sync
> > 
> 
> Would it as of now? udmabuf_create() pulls the shmem pages out of the memfd,
> not sure how DMA_BUF_IOCTL_SYNC would help to update that whenever the pages
> inside the memfd would change (e.g., FALLOC_FL_PUNCH_HOLE + realloc).
> 
> But that's most probably simply "not supported".

Yes, the pages which udmabuf is holding would be the originals: they will
then be detached from the hole-punched file, and subsequent faults or writes
to that backing file (through shmem, rather than through udmabuf) can fill
in the holes with new, different pages.  So long as that's well understood,
then it's not necessarily a disaster.

I see udmabuf asks for SEAL_SHRINK (I guess to keep away from SIGBUS),
but refuses SEAL_WRITE - so hole-punching remains permitted.

> 
> >>
> >>
> >> Was linux-mm (and especially shmem maintainers, ccing Hugh) involved in
> >> the upstreaming of udmabuf?

Thanks for the Cc, David.  No, I wasn't involved at all; but I probably
would not have understood their needs much better then than now.

I don't 

Re: [PATCHv2 2/3] i915: convert to new mount API

2019-08-08 Thread Hugh Dickins
On Thu, 8 Aug 2019, Al Viro wrote:
> On Wed, Aug 07, 2019 at 08:30:02AM +0200, Christoph Hellwig wrote:
> > On Tue, Aug 06, 2019 at 12:50:10AM -0700, Hugh Dickins wrote:
> > > Though personally I'm averse to managing "f"objects through
> > > "m"interfaces, which can get ridiculous (notably, MADV_HUGEPAGE works
> > > on the virtual address of a mapping, but the huge-or-not alignment of
> > > that mapping must have been decided previously).  In Google we do use
> > > fcntls F_HUGEPAGE and F_NOHUGEPAGE to override on a per-file basis -
> > > one day I'll get to upstreaming those.
> > 
> > Such an interface seems very useful, although the two fcntls seem a bit
> > odd.
> > 
> > But I think the point here is that the i915 has its own somewhat odd
> > instance of tmpfs.  If we could pass the equivalent of the huge=*
> > options to shmem_file_setup all that garbage (including the
> > shmem_file_setup_with_mnt function) could go away.
> 
> ... or follow shmem_file_super() with whatever that fcntl maps to
> internally.  I would really love to get rid of that i915 kludge.

As to the immediate problem of i915_gemfs using remount_fs on linux-next,
IIUC, all that is necessary at the moment is the deletions patch below
(but I'd prefer that to come from the i915 folks).  Since gemfs has no
need to change the huge option from its default to its default.

As to the future of when they get back to wanting huge pages in gemfs,
yes, that can probably best be arranged by using the internals of an
fcntl F_HUGEPAGE on those objects that would benefit from it.

Though my intention there was that the "huge=never" default ought
to continue to refuse to give huge pages, even when asked by fcntl.
So a little hackery may still be required, to allow the i915_gemfs
internal mount to get huge pages when a user mount would not.

As to whether shmem_file_setup_with_mnt() needs to live: I've given
that no thought, but accept that shm_mnt is such a ragbag of different
usages, that i915 is right to prefer their own separate gemfs mount.

Hugh

--- mmotm/drivers/gpu/drm/i915/gem/i915_gemfs.c 2019-07-21 19:40:16.573703780 
-0700
+++ linux/drivers/gpu/drm/i915/gem/i915_gemfs.c 2019-08-08 07:19:23.967689058 
-0700
@@ -24,28 +24,6 @@ int i915_gemfs_init(struct drm_i915_priv
if (IS_ERR(gemfs))
return PTR_ERR(gemfs);
 
-   /*
-* Enable huge-pages for objects that are at least HPAGE_PMD_SIZE, most
-* likely 2M. Note that within_size may overallocate huge-pages, if say
-* we allocate an object of size 2M + 4K, we may get 2M + 2M, but under
-* memory pressure shmem should split any huge-pages which can be
-* shrunk.
-*/
-
-   if (has_transparent_hugepage()) {
-   struct super_block *sb = gemfs->mnt_sb;
-   /* FIXME: Disabled until we get W/A for read BW issue. */
-   char options[] = "huge=never";
-   int flags = 0;
-   int err;
-
-   err = sb->s_op->remount_fs(sb, , options);
-   if (err) {
-   kern_unmount(gemfs);
-   return err;
-   }
-   }
-
i915->mm.gemfs = gemfs;
 
return 0;


Re: [PATCHv2 2/3] i915: convert to new mount API

2019-08-06 Thread Hugh Dickins
On Mon, 5 Aug 2019, Al Viro wrote:
> On Mon, Aug 05, 2019 at 07:12:55PM +0100, Al Viro wrote:
> > On Tue, Aug 06, 2019 at 01:03:06AM +0900, Sergey Senozhatsky wrote:
> > > tmpfs does not set ->remount_fs() anymore and its users need
> > > to be converted to new mount API.
> > 
> > Could you explain why the devil do you bother with remount at all?
> > Why not pass the right options when mounting the damn thing?
> 
> ... and while we are at it, I really wonder what's going on with
> that gemfs thing - among the other things, this is the only
> user of shmem_file_setup_with_mnt().  Sure, you want your own
> options, but that brings another question - is there any reason
> for having the huge=... per-superblock rather than per-file?

Yes: we want a default for how files of that superblock are to
allocate their pages, without people having to fcntl or advise
each of their files.

Setting aside the weirder options (within_size, advise) and emergency/
testing override (shmem_huge), we want files on an ordinary default
tmpfs (huge=never) to be allocated with small pages (so users with
access to that filesystem will not consume, and will not waste time
and space on consuming, the more valuable huge pages); but files on a
huge=always tmpfs to be allocated with huge pages whenever possible.

Or am I missing your point?  Yes, hugeness can certainly be decided
differently per-file, or even per-extent of file.  That is already
made possible through "judicious" use of madvise MADV_HUGEPAGE and
MADV_NOHUGEPAGE on mmaps of the file, carried over from anon THP.

Though personally I'm averse to managing "f"objects through
"m"interfaces, which can get ridiculous (notably, MADV_HUGEPAGE works
on the virtual address of a mapping, but the huge-or-not alignment of
that mapping must have been decided previously).  In Google we do use
fcntls F_HUGEPAGE and F_NOHUGEPAGE to override on a per-file basis -
one day I'll get to upstreaming those.

Hugh

> 
> After all, the readers of ->huge in mm/shmem.c are
> mm/shmem.c:582: (shmem_huge == SHMEM_HUGE_FORCE || sbinfo->huge) &&
>   is_huge_enabled(), sbinfo is an explicit argument
> 
> mm/shmem.c:1799:switch (sbinfo->huge) {
>   shmem_getpage_gfp(), sbinfo comes from inode
> 
> mm/shmem.c:2113:if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER)
>   shmem_get_unmapped_area(), sb comes from file
> 
> mm/shmem.c:3531:if (sbinfo->huge)
> mm/shmem.c:3532:seq_printf(seq, ",huge=%s", 
> shmem_format_huge(sbinfo->huge));
>   ->show_options()
> mm/shmem.c:3880:switch (sbinfo->huge) {
>   shmem_huge_enabled(), sbinfo comes from an inode
> 
> And the only caller of is_huge_enabled() is shmem_getattr(), with sbinfo
> picked from inode.
> 
> So is there any reason why the hugepage policy can't be per-file, with
> the current being overridable default?


[PATCH 28/83] mm: Change timing of notification to IOMMUs about a page to be invalidated

2014-07-10 Thread Hugh Dickins
On Fri, 11 Jul 2014, Joerg Roedel wrote:
> On Fri, Jul 11, 2014 at 12:53:26AM +0300, Oded Gabbay wrote:
> >  mm/rmap.c | 8 ++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 196cd0c..73d4c3d 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1231,13 +1231,17 @@ static int try_to_unmap_one(struct page *page, 
> > struct vm_area_struct *vma,
> > } else
> > dec_mm_counter(mm, MM_FILEPAGES);
> >  
> > +   pte_unmap_unlock(pte, ptl);
> > +
> > +   mmu_notifier_invalidate_page(vma, address, event);
> > +
> > page_remove_rmap(page);
> > page_cache_release(page);
> >  
> > +   return ret;
> > +
> >  out_unmap:
> > pte_unmap_unlock(pte, ptl);
> > -   if (ret != SWAP_FAIL && !(flags & TTU_MUNLOCK))
> > -   mmu_notifier_invalidate_page(vma, address, event);
> >  out:
> > return ret;
> 
> I think there is no bug. In that function the page is just unmapped,
> removed from the rmap (page_remove_rmap), and the LRU list
> (page_cache_release). The page itself is not released in this function,
> so the call mmu_notifier_invalidate_page() at the end is fine.

Agreed, nothing to fix here: the try_to_unmap() callers must hold
their own reference to the page.  If they did not, how could they
be sure that this is a page which is appropriate to unmap?

(Nit: we don't actually take a separate reference for the LRU list:
the page_cache_release above corresponds to the reference in the
pte which has just been removed.)

Hugh


Debugging Thinkpad T430s occasional suspend failure.

2013-02-17 Thread Hugh Dickins
On Sun, 17 Feb 2013, Daniel Vetter wrote:
> On Sun, Feb 17, 2013 at 3:21 AM, Hugh Dickins  wrote:
> > On Sat, 16 Feb 2013, Hugh Dickins wrote:
> >> On Sat, 16 Feb 2013, Linus Torvalds wrote:
> >> >
> >> > I think it's worth it to give them a heads-up already. So I've cc'd
> >> > the main suspects here..
> >>
> >> Okay, thanks.
> >>
> >> >
> >> > Daniel, Dave - any comments about a NULL fb in
> >> > intel_choose_pipe_bpp_dither() during either suspend or resume? Some
> >> > googling shows this:
> >> >
> >> > https://bugzilla.redhat.com/show_bug.cgi?id=895123
> >>
> >> Great, yes, I'm sure that's the same (though it says "suspend"
> >> and I say "resume").
> >>
> >> >
> >> > which sounds remarkably similar, and is also during a suspend attempt
> >> > (but apparently Satish got a full oops out).. Some timing race with a
> >> > worker entry?
> >
> > Comparing Satish's backtrace and drivers/gpu/drm history, it's clear that
> > the oops comes from Daniel's 3.8-rc2 45e2b5f640b3 "drm/i915: force restore
> > on lid open", whose force_restore case now passes down crtc->base.fb.  But
> > I wouldn't have a clue why that's usually non-NULL but occasionally NULL:
> > your timing race with a worker entry, perhaps.
> >
> > And 45e2b5f640b3 contains a fine history of going back and forth, so I
> > wouldn't want to play further with it out of ignorance - though tempted
> > to replace the "if (force_restore) {" by an interim safe-seeming
> > compromise of "if (force_restore && crtc->base.fb) {".

My suggestion there in the line above was stupidly wrong :(

> 
> Two things to try while I try to grow a clue on what exactly is going on:

Thank you.

By the way, I hope you've looked back through this thread, and realize
that Dave and I both had ThinkPad T4?0s suspend/resume display problems,
but they've gone off in different directions: so a lot of the discussion
comes from Dave having CONFIG_PROVE_RCU_DELAY, and has nothing to do with
what we now know to be my oops in i915/intel_display.c.

> 
> 1. Related to new ACPI sleep states we've recently made the lid_notifier
> locking more sound, maybe that helps:
> 
> http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queued=b8efb17b3d687695b81485f606fc4e6c35a50f9a

Looks like it may be relevant, but I'll ignore it for now:
preferring first to test the more minimal safety you suggest below.

> 
> 2. The new i915 force restore code is indeed missing a safety check
> compared to the old crtc helper based one:
> 
> diff --git a/drivers/gpu/drm/i915/intel_display.c 
> b/drivers/gpu/drm/i915/intel_display.c
> index 6eb3882..095094c 100644
> --- a/drivers/gpu/drm/i915/intel_display.c
> +++ b/drivers/gpu/drm/i915/intel_display.c
> @@ -9153,7 +9153,13 @@ void intel_modeset_setup_hw_state(struct drm_device 
> *dev,
>  
>   if (force_restore) {
>   for_each_pipe(pipe) {
> - 
> intel_crtc_restore_mode(dev_priv->pipe_to_crtc_mapping[pipe]);
> + struct drm_crtc * crtc =
> + dev_priv->pipe_to_crtc_mapping[pipe];
> +
> + if (!crtc->enabled)
> + continue;
> +
> + intel_crtc_restore_mode(crtc);
>   }
>  
>   i915_redisable_vga(dev);

I see your followup, where you observe that intel_modeset_affected_pipes()
should already have made this check; but you do say it would still be good
to prove one way or the other, so I'll run from now with the patch below.

Note that we haven't got to worrying about what will be in 3.9 yet
(linux-next tells me to expect hangcheck timer problems): it's Linus's
current git for 3.8 final that we're working on at present.

And since quick resumes have always worked for me, it's only occasionally
that a long suspend (e.g. overnight) fails for me in this way, so I won't
be able to report success for several days - but failure may come sooner.

And, it being very tiresome to debug when it does fail, I have inserted
WARN_ONs and more safety: here's what I'm running with now.

--- 3.8-rc7/drivers/gpu/drm/i915/intel_display.c2013-01-17 
20:06:11.384002362 -0800
+++ linux/drivers/gpu/drm/i915/intel_display.c  2013-02-17 07:50:28.012075150 
-0800
@@ -4156,7 +4156,9 @@ static bool intel_choose_pipe_bpp_dither
 * also stays within the max display bpc discovered above.
 */

-   switch (fb->depth) {
+   if (WARN_ON(!fb))
+   bpc = 8;
+   else switch (fb

Re: Debugging Thinkpad T430s occasional suspend failure.

2013-02-17 Thread Hugh Dickins
On Sat, 16 Feb 2013, Linus Torvalds wrote:
 On Sat, Feb 16, 2013 at 1:45 PM, Hugh Dickins hu...@google.com wrote:
 
  I hacked around on your PM_TRACE set_magic_time() / read_magic_time()
  yesterday, to save an oopsing core kernel ip there, instead of hashed
  pm trace info (it makes sense in this case to invert your sequence,
  putting the high order into years and the low order into minutes).
 
 That sounds like a good idea in general. The PM_TRACE() thing was done
 to figure out things that locked up the PCI bus etc, but encoding the
 oopses during suspend sounds like a really good idea too.

Yes, it can and should be generalized, but needs more than I gave it.

 
 Is your patch clean enough to just be made part of the standard
 PM_TRACE infrastructure, or was it something really hacky and one-off?

Not really hacky, but three reasons why it cannot be standard without
more work (I am supposing that it should not be tied to suspend/resume,
but could be used for any oops which gets hidden by graphic screen,
and also fails to reach /var/log/messages).

1. Because I usually have a version of KDB patched into my kernels
(forked many years ago, never time to integrate with that subset
which eventually went into 2.6.??), it was easiest to implement as
a pair of KDB commands (needing keyboard input: I already knew I
could reboot from KDB in this state).  (Sidenote: using KDB can
prevent getting a trace into /var/log/messages: so I had tried
without it, but still failed to get one.)

2. I don't use initrd, have very little in modules, and a pared down
kernel: so for me it was not a serious restriction to limit to core
kernel addresses, which easily fitted within the limit; but we ought
to put thought into handling module addresses too.

3. Needs care on the interface: a debug config option presumably,
but perhaps also needs to tie in with panic_on_oops or something -
I don't like my date getting messed up by surprise, and there's no
point in messing it up unless there's reason to believe the machine
will very quickly be rebooted.  Since I had to open the lid to resume
to hit the oops, in this case we could rely on me quickly rebooting.

But I've appended what I had below, so it's on the record, and can
be taken further if (likely) someone gets there sooner than I do.

 
  Rewarded last night by reboot to Feb 21 14:45:53 2006.  Which is
  812d60ed intel_choose_pipe_bpp_dither.isra.13+0x216/0x2d6
 
  /home/hugh/3087X/drivers/gpu/drm/i915/intel_display.c:4159
   * enable dithering as needed, but that costs bandwidth.  So choose
   * the minimum value that expresses the full color range of the fb 
  but
   * also stays within the max display bpc discovered above.
   */
 
  switch (fb-depth) {
  812d60e9:   48 8b 55 c0 mov-0x40(%rbp),%rdx
  812d60ed:   8b 02   mov(%rdx),%eax
 
  (gcc chose to pass a pointer to fb-depth down to the function,
  instead of fb itself, since that is the only use of it there.)
 
  I expect that fb is NULL; but with an average of one failure to resume
  per day, and ~26 bits of info per crash, this is not a fast procedure!
 
  I notice that intel_pipe_set_base() allows for NULL fb,
  so I'm currently running with the oops-in-rtc hackery, plus
  -   switch (fb-depth) {
  +   if (WARN_ON(!fb))
  +   bpc = 8;
  +   else switch (fb-depth) {
 
  There's been a fair bit of change to intel_display.c since 3.7 (if
  my 3.7 was indeed good), mainly splitting intel_ into haswell_ versus
  ironlake_, but I've not yet spotted anything obvious; nor yet looked
  to see where fb would originate from anyway.
 
  Once I've got just a little more info out of it, I'll start another
  thread addressed principally to the drm/gpu/i915 guys.
 
 I think it's worth it to give them a heads-up already. So I've cc'd
 the main suspects here..

Okay, thanks.

 
 Daniel, Dave - any comments about a NULL fb in
 intel_choose_pipe_bpp_dither() during either suspend or resume? Some
 googling shows this:
 
 https://bugzilla.redhat.com/show_bug.cgi?id=895123

Great, yes, I'm sure that's the same (though it says suspend
and I say resume).

 
 which sounds remarkably similar, and is also during a suspend attempt
 (but apparently Satish got a full oops out).. Some timing race with a
 worker entry?
 
 Linus

#include linux/rtc.h
#include asm/rtc.h
/*
 * HughD adapted the following from drivers/base/power/trace.c:
 *
 * Copyright (C) 2006 Linus Torvalds
 *
 * Trace facility for suspend/resume problems, when none of the
 * devices may be working.
 *
 * Horrid, horrid, horrid.
 *
 * It turns out that the _only_ piece of hardware that actually
 * keeps its value across a hard boot (and, more importantly, the
 * POST init sequence) is literally the realtime clock.
 *
 * Never mind that an RTC chip has 114 bytes (and often a whole
 * other bank of an additional 128 bytes

Re: Debugging Thinkpad T430s occasional suspend failure.

2013-02-17 Thread Hugh Dickins
On Sat, 16 Feb 2013, Hugh Dickins wrote:
 On Sat, 16 Feb 2013, Linus Torvalds wrote:
  
  I think it's worth it to give them a heads-up already. So I've cc'd
  the main suspects here..
 
 Okay, thanks.
 
  
  Daniel, Dave - any comments about a NULL fb in
  intel_choose_pipe_bpp_dither() during either suspend or resume? Some
  googling shows this:
  
  https://bugzilla.redhat.com/show_bug.cgi?id=895123
 
 Great, yes, I'm sure that's the same (though it says suspend
 and I say resume).
 
  
  which sounds remarkably similar, and is also during a suspend attempt
  (but apparently Satish got a full oops out).. Some timing race with a
  worker entry?

Comparing Satish's backtrace and drivers/gpu/drm history, it's clear that
the oops comes from Daniel's 3.8-rc2 45e2b5f640b3 drm/i915: force restore
on lid open, whose force_restore case now passes down crtc-base.fb.  But
I wouldn't have a clue why that's usually non-NULL but occasionally NULL:
your timing race with a worker entry, perhaps.

And 45e2b5f640b3 contains a fine history of going back and forth, so I
wouldn't want to play further with it out of ignorance - though tempted
to replace the if (force_restore) { by an interim safe-seeming
compromise of if (force_restore  crtc-base.fb) {.

Hugh
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: Debugging Thinkpad T430s occasional suspend failure.

2013-02-17 Thread Hugh Dickins
On Sun, 17 Feb 2013, Daniel Vetter wrote:
 On Sun, Feb 17, 2013 at 3:21 AM, Hugh Dickins hu...@google.com wrote:
  On Sat, 16 Feb 2013, Hugh Dickins wrote:
  On Sat, 16 Feb 2013, Linus Torvalds wrote:
  
   I think it's worth it to give them a heads-up already. So I've cc'd
   the main suspects here..
 
  Okay, thanks.
 
  
   Daniel, Dave - any comments about a NULL fb in
   intel_choose_pipe_bpp_dither() during either suspend or resume? Some
   googling shows this:
  
   https://bugzilla.redhat.com/show_bug.cgi?id=895123
 
  Great, yes, I'm sure that's the same (though it says suspend
  and I say resume).
 
  
   which sounds remarkably similar, and is also during a suspend attempt
   (but apparently Satish got a full oops out).. Some timing race with a
   worker entry?
 
  Comparing Satish's backtrace and drivers/gpu/drm history, it's clear that
  the oops comes from Daniel's 3.8-rc2 45e2b5f640b3 drm/i915: force restore
  on lid open, whose force_restore case now passes down crtc-base.fb.  But
  I wouldn't have a clue why that's usually non-NULL but occasionally NULL:
  your timing race with a worker entry, perhaps.
 
  And 45e2b5f640b3 contains a fine history of going back and forth, so I
  wouldn't want to play further with it out of ignorance - though tempted
  to replace the if (force_restore) { by an interim safe-seeming
  compromise of if (force_restore  crtc-base.fb) {.

My suggestion there in the line above was stupidly wrong :(

 
 Two things to try while I try to grow a clue on what exactly is going on:

Thank you.

By the way, I hope you've looked back through this thread, and realize
that Dave and I both had ThinkPad T4?0s suspend/resume display problems,
but they've gone off in different directions: so a lot of the discussion
comes from Dave having CONFIG_PROVE_RCU_DELAY, and has nothing to do with
what we now know to be my oops in i915/intel_display.c.

 
 1. Related to new ACPI sleep states we've recently made the lid_notifier
 locking more sound, maybe that helps:
 
 http://cgit.freedesktop.org/~danvet/drm-intel/commit/?h=drm-intel-next-queuedid=b8efb17b3d687695b81485f606fc4e6c35a50f9a

Looks like it may be relevant, but I'll ignore it for now:
preferring first to test the more minimal safety you suggest below.

 
 2. The new i915 force restore code is indeed missing a safety check
 compared to the old crtc helper based one:
 
 diff --git a/drivers/gpu/drm/i915/intel_display.c 
 b/drivers/gpu/drm/i915/intel_display.c
 index 6eb3882..095094c 100644
 --- a/drivers/gpu/drm/i915/intel_display.c
 +++ b/drivers/gpu/drm/i915/intel_display.c
 @@ -9153,7 +9153,13 @@ void intel_modeset_setup_hw_state(struct drm_device 
 *dev,
  
   if (force_restore) {
   for_each_pipe(pipe) {
 - 
 intel_crtc_restore_mode(dev_priv-pipe_to_crtc_mapping[pipe]);
 + struct drm_crtc * crtc =
 + dev_priv-pipe_to_crtc_mapping[pipe];
 +
 + if (!crtc-enabled)
 + continue;
 +
 + intel_crtc_restore_mode(crtc);
   }
  
   i915_redisable_vga(dev);

I see your followup, where you observe that intel_modeset_affected_pipes()
should already have made this check; but you do say it would still be good
to prove one way or the other, so I'll run from now with the patch below.

Note that we haven't got to worrying about what will be in 3.9 yet
(linux-next tells me to expect hangcheck timer problems): it's Linus's
current git for 3.8 final that we're working on at present.

And since quick resumes have always worked for me, it's only occasionally
that a long suspend (e.g. overnight) fails for me in this way, so I won't
be able to report success for several days - but failure may come sooner.

And, it being very tiresome to debug when it does fail, I have inserted
WARN_ONs and more safety: here's what I'm running with now.

--- 3.8-rc7/drivers/gpu/drm/i915/intel_display.c2013-01-17 
20:06:11.384002362 -0800
+++ linux/drivers/gpu/drm/i915/intel_display.c  2013-02-17 07:50:28.012075150 
-0800
@@ -4156,7 +4156,9 @@ static bool intel_choose_pipe_bpp_dither
 * also stays within the max display bpc discovered above.
 */
 
-   switch (fb-depth) {
+   if (WARN_ON(!fb))
+   bpc = 8;
+   else switch (fb-depth) {
case 8:
bpc = 8; /* since we go through a colormap */
break;
@@ -9302,6 +9304,10 @@ void intel_modeset_setup_hw_state(struct
if (force_restore) {
for_each_pipe(pipe) {
crtc = 
to_intel_crtc(dev_priv-pipe_to_crtc_mapping[pipe]);
+   if (WARN_ON(!crtc-base.enabled))
+   continue;
+   if (WARN_ON(!crtc-base.fb))
+   continue;
intel_set_mode(crtc-base, crtc-base.mode,
   crtc

Debugging Thinkpad T430s occasional suspend failure.

2013-02-16 Thread Hugh Dickins
On Sat, 16 Feb 2013, Hugh Dickins wrote:
> On Sat, 16 Feb 2013, Linus Torvalds wrote:
> > 
> > I think it's worth it to give them a heads-up already. So I've cc'd
> > the main suspects here..
> 
> Okay, thanks.
> 
> > 
> > Daniel, Dave - any comments about a NULL fb in
> > intel_choose_pipe_bpp_dither() during either suspend or resume? Some
> > googling shows this:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=895123
> 
> Great, yes, I'm sure that's the same (though it says "suspend"
> and I say "resume").
> 
> > 
> > which sounds remarkably similar, and is also during a suspend attempt
> > (but apparently Satish got a full oops out).. Some timing race with a
> > worker entry?

Comparing Satish's backtrace and drivers/gpu/drm history, it's clear that
the oops comes from Daniel's 3.8-rc2 45e2b5f640b3 "drm/i915: force restore
on lid open", whose force_restore case now passes down crtc->base.fb.  But
I wouldn't have a clue why that's usually non-NULL but occasionally NULL:
your timing race with a worker entry, perhaps.

And 45e2b5f640b3 contains a fine history of going back and forth, so I
wouldn't want to play further with it out of ignorance - though tempted
to replace the "if (force_restore) {" by an interim safe-seeming
compromise of "if (force_restore && crtc->base.fb) {".

Hugh


Debugging Thinkpad T430s occasional suspend failure.

2013-02-16 Thread Hugh Dickins
On Sat, 16 Feb 2013, Linus Torvalds wrote:
> On Sat, Feb 16, 2013 at 1:45 PM, Hugh Dickins  wrote:
> >
> > I hacked around on your PM_TRACE set_magic_time() / read_magic_time()
> > yesterday, to save an oopsing core kernel ip there, instead of hashed
> > pm trace info (it makes sense in this case to invert your sequence,
> > putting the high order into years and the low order into minutes).
> 
> That sounds like a good idea in general. The PM_TRACE() thing was done
> to figure out things that locked up the PCI bus etc, but encoding the
> oopses during suspend sounds like a really good idea too.

Yes, it can and should be generalized, but needs more than I gave it.

> 
> Is your patch clean enough to just be made part of the standard
> PM_TRACE infrastructure, or was it something really hacky and one-off?

Not really hacky, but three reasons why it cannot be standard without
more work (I am supposing that it should not be tied to suspend/resume,
but could be used for any oops which gets hidden by graphic screen,
and also fails to reach /var/log/messages).

1. Because I usually have a version of KDB patched into my kernels
("forked" many years ago, never time to integrate with that subset
which eventually went into 2.6.??), it was easiest to implement as
a pair of KDB commands (needing keyboard input: I already knew I
could "reboot" from KDB in this state).  (Sidenote: using KDB can
prevent getting a trace into /var/log/messages: so I had tried
without it, but still failed to get one.)

2. I don't use initrd, have very little in modules, and a pared down
kernel: so for me it was not a serious restriction to limit to core
kernel addresses, which easily fitted within the limit; but we ought
to put thought into handling module addresses too.

3. Needs care on the interface: a debug config option presumably,
but perhaps also needs to tie in with panic_on_oops or something -
I don't like my date getting messed up by surprise, and there's no
point in messing it up unless there's reason to believe the machine
will very quickly be rebooted.  Since I had to open the lid to resume
to hit the oops, in this case we could rely on me quickly rebooting.

But I've appended what I had below, so it's on the record, and can
be taken further if (likely) someone gets there sooner than I do.

> 
> > Rewarded last night by reboot to Feb 21 14:45:53 2006.  Which is
> > 812d60ed intel_choose_pipe_bpp_dither.isra.13+0x216/0x2d6
> >
> > /home/hugh/3087X/drivers/gpu/drm/i915/intel_display.c:4159
> >  * enable dithering as needed, but that costs bandwidth.  So choose
> >  * the minimum value that expresses the full color range of the fb 
> > but
> >  * also stays within the max display bpc discovered above.
> >  */
> >
> > switch (fb->depth) {
> > 812d60e9:   48 8b 55 c0 mov-0x40(%rbp),%rdx
> > 812d60ed:   8b 02   mov(%rdx),%eax
> >
> > (gcc chose to pass a pointer to fb->depth down to the function,
> > instead of fb itself, since that is the only use of it there.)
> >
> > I expect that fb is NULL; but with an average of one failure to resume
> > per day, and ~26 bits of info per crash, this is not a fast procedure!
> >
> > I notice that intel_pipe_set_base() allows for NULL fb,
> > so I'm currently running with the oops-in-rtc hackery, plus
> > -   switch (fb->depth) {
> > +   if (WARN_ON(!fb))
> > +   bpc = 8;
> > +   else switch (fb->depth) {
> >
> > There's been a fair bit of change to intel_display.c since 3.7 (if
> > my 3.7 was indeed good), mainly splitting intel_ into haswell_ versus
> > ironlake_, but I've not yet spotted anything obvious; nor yet looked
> > to see where fb would originate from anyway.
> >
> > Once I've got just a little more info out of it, I'll start another
> > thread addressed principally to the drm/gpu/i915 guys.
> 
> I think it's worth it to give them a heads-up already. So I've cc'd
> the main suspects here..

Okay, thanks.

> 
> Daniel, Dave - any comments about a NULL fb in
> intel_choose_pipe_bpp_dither() during either suspend or resume? Some
> googling shows this:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=895123

Great, yes, I'm sure that's the same (though it says "suspend"
and I say "resume").

> 
> which sounds remarkably similar, and is also during a suspend attempt
> (but apparently Satish got a full oops out).. Some timing race with a
> worker entry?
> 
> Linus

#include 
#include 
/*
 * HughD adapted the following from drivers/base/power/trace.c:
 *
 * Copyright (C) 2006 Lin

Re: inux-next: Tree for Sept 26 (not bootable on AMD64: thermal|acpi|drm/i915|pci related?)

2012-09-27 Thread Hugh Dickins
On Wed, 26 Sep 2012, Sedat Dilek wrote:
 
 on my Ubuntu/precise AMD64 today's Linux-Next runs into the following
 call-trace (machine freezes):
 
  Sep 26 19:22:58 fambox kernel: [   11.124739] BUG: unable to handle
 kernel NULL pointer dereference at 0018
 Sep 26 19:22:58 fambox kernel: [   11.124806] IP: [814bb058]
 thermal_cooling_device_register+0x2c8/0x3d0
 Sep 26 19:22:58 fambox kernel: [   11.124869] PGD 0
 Sep 26 19:22:58 fambox kernel: [   11.124895] Oops:  [#1] SMP
 Sep 26 19:22:58 fambox kernel: [   11.124919] Modules linked in:
 coretemp kvm_intel kvm snd_hda_intel(+) arc4 snd_hda_codec iwldvm
 snd_hwdep ghash_clmulni_intel snd_pcm aesni_intel uvcvideo mac80211
 aes_x86_64 snd_page_alloc ablk_helper i915(+) snd_seq_midi cryptd
 videobuf2_vmalloc snd_seq_midi_event xts videobuf2_memops lrw
 snd_rawmidi videobuf2_core joydev gf128mul videodev snd_seq
 snd_seq_device hid_generic snd_timer drm_kms_helper iwlwifi drm snd
 psmouse i2c_algo_bit soundcore btusb microcode serio_raw
 samsung_laptop wmi cfg80211 bluetooth mei mac_hid video lpc_ich lp
 parport ext4 jbd2 usbhid hid r8169
 Sep 26 19:22:58 fambox kernel: [   11.125319] CPU 2
 Sep 26 19:22:58 fambox kernel: [   11.125332] Pid: 579, comm: modprobe
 Not tainted 3.6.0-rc7-next20120926-2-iniza-generic #1 SAMSUNG
 ELECTRONICS CO., LTD. 530U3BI/530U4BI/530U4BH/530U3BI/530U4BI/530U4BH
 Sep 26 19:22:58 fambox kernel: [   11.125401] RIP:
 0010:[814bb058]  [814bb058]
 thermal_cooling_device_register+0x2c8/0x3d0
 Sep 26 19:22:58 fambox kernel: [   11.125450] RSP:
 0018:88010f23d838  EFLAGS: 00010246
 Sep 26 19:22:58 fambox kernel: [   11.125475] RAX: 
 RBX: 88010bfd4c00 RCX: 0001
 Sep 26 19:22:58 fambox kernel: [   11.125507] RDX: 
 RSI: 0282 RDI: 0282
 Sep 26 19:22:58 fambox kernel: [   11.125539] RBP: 88010f23d878
 R08:  R09: 0001
 Sep 26 19:22:58 fambox kernel: [   11.125570] R10: 8801054fc460
 R11:  R12: 88010bfd4c04
 Sep 26 19:22:58 fambox kernel: [   11.125602] R13: 88011a73e000
 R14:  R15: 
 Sep 26 19:22:58 fambox kernel: [   11.125634] FS:
 7f6b8da29700() GS:88011fa8()
 knlGS:
 Sep 26 19:22:58 fambox kernel: [   11.125670] CS:  0010 DS:  ES:
  CR0: 80050033
 Sep 26 19:22:58 fambox kernel: [   11.125697] CR2: 0018
 CR3: 000110185000 CR4: 000407e0
 Sep 26 19:22:58 fambox kernel: [   11.125729] DR0: 
 DR1:  DR2: 
 Sep 26 19:22:58 fambox kernel: [   11.125761] DR3: 
 DR6: 0ff0 DR7: 0400
 Sep 26 19:22:58 fambox kernel: [   11.125793] Process modprobe (pid:
 579, threadinfo 88010f23c000, task 880116648000)
 Sep 26 19:22:58 fambox kernel: [   11.127133] Stack:
 Sep 26 19:22:58 fambox kernel: [   11.128461]  88011a5d9098
 88010d74c300 88010f23d878 880110b06480
 Sep 26 19:22:58 fambox kernel: [   11.129755]  88010bfd1000
 88011a5d9098 88010d74c300 88011a625000
 Sep 26 19:22:58 fambox kernel: [   11.130943]  88010f23d958
 a00e5868  811fa033
 Sep 26 19:22:58 fambox kernel: [   11.132749] Call Trace:
 Sep 26 19:22:58 fambox kernel: [   11.133939]  [a00e5868]
 acpi_video_bus_add+0x9ba/0xce6 [video]
 Sep 26 19:22:58 fambox kernel: [   11.135130]  [811fa033] ?
 sysfs_addrm_finish+0x33/0xc0
 Sep 26 19:22:58 fambox kernel: [   11.136313]  [813454cc]
 acpi_device_probe+0x4e/0x11c
 Sep 26 19:22:58 fambox kernel: [   11.137482]  [813d472b]
 driver_probe_device+0x7b/0x240
 Sep 26 19:22:58 fambox kernel: [   11.138642]  [813d499b]
 __driver_attach+0xab/0xb0
 Sep 26 19:22:58 fambox kernel: [   11.139798]  [813d48f0] ?
 driver_probe_device+0x240/0x240
 Sep 26 19:22:58 fambox kernel: [   11.140981]  [813d2b46]
 bus_for_each_dev+0x56/0x90
 Sep 26 19:22:58 fambox kernel: [   11.142129]  [813d425e]
 driver_attach+0x1e/0x20
 Sep 26 19:22:58 fambox kernel: [   11.143264]  [813d3dd0]
 bus_add_driver+0x190/0x290
 Sep 26 19:22:58 fambox kernel: [   11.13]  [813d4efa]
 driver_register+0x7a/0x160
 Sep 26 19:22:58 fambox kernel: [   11.145583]  [81345ccf]
 acpi_bus_register_driver+0x43/0x45
 Sep 26 19:22:58 fambox kernel: [   11.146871]  [a00e4dac]
 acpi_video_register+0x20/0x39 [video]
 Sep 26 19:22:58 fambox kernel: [   11.148167]  [a02f4bad]
 i915_driver_load+0x83d/0xea0 [i915]
 Sep 26 19:22:58 fambox kernel: [   11.149451]  [a020ebc1]
 drm_get_pci_dev+0x191/0x2b0 [drm]
 Sep 26 19:22:58 fambox kernel: [   11.150739]  [a0345e2b]
 i915_pci_probe+0x4f/0x57 [i915]
 Sep 26 19:22:58 fambox kernel: [   11.152015]  [81309af9]
 local_pci_probe+0x79/0x100
 Sep 26 19:22:58 fambox kernel: [   11.153287]  

inux-next: Tree for Sept 26 (not bootable on AMD64: thermal|acpi|drm/i915|pci related?)

2012-09-26 Thread Hugh Dickins
On Wed, 26 Sep 2012, Sedat Dilek wrote:
> 
> on my Ubuntu/precise AMD64 today's Linux-Next runs into the following
> call-trace (machine freezes):
> 
>  Sep 26 19:22:58 fambox kernel: [   11.124739] BUG: unable to handle
> kernel NULL pointer dereference at 0018
> Sep 26 19:22:58 fambox kernel: [   11.124806] IP: []
> thermal_cooling_device_register+0x2c8/0x3d0
> Sep 26 19:22:58 fambox kernel: [   11.124869] PGD 0
> Sep 26 19:22:58 fambox kernel: [   11.124895] Oops:  [#1] SMP
> Sep 26 19:22:58 fambox kernel: [   11.124919] Modules linked in:
> coretemp kvm_intel kvm snd_hda_intel(+) arc4 snd_hda_codec iwldvm
> snd_hwdep ghash_clmulni_intel snd_pcm aesni_intel uvcvideo mac80211
> aes_x86_64 snd_page_alloc ablk_helper i915(+) snd_seq_midi cryptd
> videobuf2_vmalloc snd_seq_midi_event xts videobuf2_memops lrw
> snd_rawmidi videobuf2_core joydev gf128mul videodev snd_seq
> snd_seq_device hid_generic snd_timer drm_kms_helper iwlwifi drm snd
> psmouse i2c_algo_bit soundcore btusb microcode serio_raw
> samsung_laptop wmi cfg80211 bluetooth mei mac_hid video lpc_ich lp
> parport ext4 jbd2 usbhid hid r8169
> Sep 26 19:22:58 fambox kernel: [   11.125319] CPU 2
> Sep 26 19:22:58 fambox kernel: [   11.125332] Pid: 579, comm: modprobe
> Not tainted 3.6.0-rc7-next20120926-2-iniza-generic #1 SAMSUNG
> ELECTRONICS CO., LTD. 530U3BI/530U4BI/530U4BH/530U3BI/530U4BI/530U4BH
> Sep 26 19:22:58 fambox kernel: [   11.125401] RIP:
> 0010:[]  []
> thermal_cooling_device_register+0x2c8/0x3d0
> Sep 26 19:22:58 fambox kernel: [   11.125450] RSP:
> 0018:88010f23d838  EFLAGS: 00010246
> Sep 26 19:22:58 fambox kernel: [   11.125475] RAX: 
> RBX: 88010bfd4c00 RCX: 0001
> Sep 26 19:22:58 fambox kernel: [   11.125507] RDX: 
> RSI: 0282 RDI: 0282
> Sep 26 19:22:58 fambox kernel: [   11.125539] RBP: 88010f23d878
> R08:  R09: 0001
> Sep 26 19:22:58 fambox kernel: [   11.125570] R10: 8801054fc460
> R11:  R12: 88010bfd4c04
> Sep 26 19:22:58 fambox kernel: [   11.125602] R13: 88011a73e000
> R14:  R15: 
> Sep 26 19:22:58 fambox kernel: [   11.125634] FS:
> 7f6b8da29700() GS:88011fa8()
> knlGS:
> Sep 26 19:22:58 fambox kernel: [   11.125670] CS:  0010 DS:  ES:
>  CR0: 80050033
> Sep 26 19:22:58 fambox kernel: [   11.125697] CR2: 0018
> CR3: 000110185000 CR4: 000407e0
> Sep 26 19:22:58 fambox kernel: [   11.125729] DR0: 
> DR1:  DR2: 
> Sep 26 19:22:58 fambox kernel: [   11.125761] DR3: 
> DR6: 0ff0 DR7: 0400
> Sep 26 19:22:58 fambox kernel: [   11.125793] Process modprobe (pid:
> 579, threadinfo 88010f23c000, task 880116648000)
> Sep 26 19:22:58 fambox kernel: [   11.127133] Stack:
> Sep 26 19:22:58 fambox kernel: [   11.128461]  88011a5d9098
> 88010d74c300 88010f23d878 880110b06480
> Sep 26 19:22:58 fambox kernel: [   11.129755]  88010bfd1000
> 88011a5d9098 88010d74c300 88011a625000
> Sep 26 19:22:58 fambox kernel: [   11.130943]  88010f23d958
> a00e5868  811fa033
> Sep 26 19:22:58 fambox kernel: [   11.132749] Call Trace:
> Sep 26 19:22:58 fambox kernel: [   11.133939]  []
> acpi_video_bus_add+0x9ba/0xce6 [video]
> Sep 26 19:22:58 fambox kernel: [   11.135130]  [] ?
> sysfs_addrm_finish+0x33/0xc0
> Sep 26 19:22:58 fambox kernel: [   11.136313]  []
> acpi_device_probe+0x4e/0x11c
> Sep 26 19:22:58 fambox kernel: [   11.137482]  []
> driver_probe_device+0x7b/0x240
> Sep 26 19:22:58 fambox kernel: [   11.138642]  []
> __driver_attach+0xab/0xb0
> Sep 26 19:22:58 fambox kernel: [   11.139798]  [] ?
> driver_probe_device+0x240/0x240
> Sep 26 19:22:58 fambox kernel: [   11.140981]  []
> bus_for_each_dev+0x56/0x90
> Sep 26 19:22:58 fambox kernel: [   11.142129]  []
> driver_attach+0x1e/0x20
> Sep 26 19:22:58 fambox kernel: [   11.143264]  []
> bus_add_driver+0x190/0x290
> Sep 26 19:22:58 fambox kernel: [   11.13]  []
> driver_register+0x7a/0x160
> Sep 26 19:22:58 fambox kernel: [   11.145583]  []
> acpi_bus_register_driver+0x43/0x45
> Sep 26 19:22:58 fambox kernel: [   11.146871]  []
> acpi_video_register+0x20/0x39 [video]
> Sep 26 19:22:58 fambox kernel: [   11.148167]  []
> i915_driver_load+0x83d/0xea0 [i915]
> Sep 26 19:22:58 fambox kernel: [   11.149451]  []
> drm_get_pci_dev+0x191/0x2b0 [drm]
> Sep 26 19:22:58 fambox kernel: [   11.150739]  []
> i915_pci_probe+0x4f/0x57 [i915]
> Sep 26 19:22:58 fambox kernel: [   11.152015]  []
> local_pci_probe+0x79/0x100
> Sep 26 19:22:58 fambox kernel: [   11.153287]  []
> pci_device_probe+0x109/0x130
> Sep 26 19:22:58 fambox kernel: [   11.154546]  []
> driver_probe_device+0x7b/0x240
> Sep 26 19:22:58 fambox kernel: [   11.155796]  []
> __driver_attach+0xab/0xb0
> Sep 26 19:22:58 fambox 

Re: [PATCH] mm: Work around Intel SNB GTT bug with some physical pages.

2012-05-09 Thread Hugh Dickins
On Mon, 7 May 2012, Stephane Marchesin wrote:

 While investing some Sandy Bridge rendering corruption, I found out
 that all physical memory pages below 1MiB were returning garbage when
 read through the GTT. This has been causing graphics corruption (when
 it's used for textures, render targets and pixmaps) and GPU hangups
 (when it's used for GPU batch buffers).
 
 I talked with some people at Intel and they confirmed my findings,
 and said that a couple of other random pages were also affected.
 
 We could fix this problem by adding an e820 region preventing the
 memory below 1 MiB to be used, but that prevents at least my machine
 from booting. One could think that we should be able to fix it in
 i915, but since the allocation is done by the backing shmem this is
 not possible.
 
 In the end, I came up with the ugly workaround of just leaking the
 offending pages in shmem.c. I do realize it's truly ugly, but I'm
 looking for a fix to the existing code, and am wondering if people on
 this list have a better idea, short of rewriting i915_gem.c to
 allocate its own pages directly.
 
 Signed-off-by: Stephane Marchesin marc...@chromium.org

Well done for discovering and pursuing this issue, but of course (as
you know: you're trying to provoke us to better) your patch is revolting.

And not even enough: swapin readahead and swapoff can read back
from swap into pages which the i915 will later turn out to dislike.

I do have a shmem.c patch coming up for gma500, which cannot use pages
over 4GB; but that fits more reasonably with memory allocation policies,
where we expect that anyone who can use a high page can use a lower as
well, and there's already __GFP_DMA32 to set the limit.

Your limitation is at the opposite end, so that patch won't help you at
all.  And I don't see how Andi's ZONE_DMA exclusion would work, without
equal hackery to enable private zonelists, avoiding that convention.

i915 is not the only user of shmem, and x86 not the only architecture:
we're not going to make everyone suffer for this.  Once the memory
allocator gets down to giving you the low 1MB, my guess is that it's
already short of memory, and liable to deadlock or OOM if you refuse
and soak up every page it then gives you.  Even if i915 has to live
with that possibility, we're not going to extend it to everyone else.

arch/x86/Kconfig has X86_RESERVE_LOW, default 64, range 4 640 (and
I think we reserve all the memory range from 640kB to 1MB anyway).
Would setting that to 640 allow you to boot, and avoid the i915
problem on all but the odd five pages?  I'm not pretending that's
an ideal solution at all (unless freeing initmem could release most
of it on non-SandyBridge and non-i915 machines), but it would be
useful to know if that does provide a stopgap solution.  If that
does work, maybe we just mark the odd five PageReserved at startup.

Is there really no way this can be handled closer to the source of
the problem, in the i915 driver itself?  I do not know the flow of
control in i915 (and X) at all, but on the surface it seems that the
problem only comes when you map these problematic pages into the GTT
(if I'm even using the right terminology), and something (not shmem.c)
actively has to do that.

Can't you check the pfn at that point, and if it's an unsuitable page,
copy into a suitable page (perhaps allocated then, perhaps from a pool
you primed earlier) and map that suitable page into the GTT instead?
Maybe using page-private to link them if that helps.

So long as the page (or its shadow) is mapped into the GTT, I imagine
it would be pinned, and not liable to be swapped out or otherwise
interfered with by shmem.c.  And when you unmap it from GTT, copy
back to the unsuitable shmem object page before unpinning.

I fully accept that I have very little understanding of GPU DRM GTT
and i915, and this may be impossible or incoherent: but please, let's
try to keep the strangeness where it belongs.  If necessary, we'll
have add some kind of flag and callback from shmem.c to the driver;
but I'd so much prefer to avoid that.

Hugh

 
 Change-Id: I957e125fb280e0b0d6b05a83cc4068df2f05aa0a
 ---
  mm/shmem.c |   39 +--
  1 files changed, 37 insertions(+), 2 deletions(-)
 
 diff --git a/mm/shmem.c b/mm/shmem.c
 index 6c253f7..dcbb58b 100644
 --- a/mm/shmem.c
 +++ b/mm/shmem.c
 @@ -768,6 +768,31 @@ redirty:
   return 0;
  }
  
 +/*
 + * Some intel GPUs can't use those pages in the GTT, which results in
 + * graphics corruption. Sadly, it's impossible to prevent usage of those
 + * pages in the intel allocator.
 + *
 + * Instead, we test for those areas here and leak the corresponding pages.
 + *
 + * Some day, when the intel GPU memory is not backed by shmem any more,
 + * we'll be able to come up with a solution which is contained in i915.
 + */
 +static bool i915_usable_page(struct page *page)
 +{
 + dma_addr_t addr = page_to_phys(page);
 +
 + if (unlikely((addr  1 * 1024 * 1024) ||
 +   

[PATCH] mm: Work around Intel SNB GTT bug with some physical pages.

2012-05-08 Thread Hugh Dickins
On Mon, 7 May 2012, Stephane Marchesin wrote:

> While investing some Sandy Bridge rendering corruption, I found out
> that all physical memory pages below 1MiB were returning garbage when
> read through the GTT. This has been causing graphics corruption (when
> it's used for textures, render targets and pixmaps) and GPU hangups
> (when it's used for GPU batch buffers).
> 
> I talked with some people at Intel and they confirmed my findings,
> and said that a couple of other random pages were also affected.
> 
> We could fix this problem by adding an e820 region preventing the
> memory below 1 MiB to be used, but that prevents at least my machine
> from booting. One could think that we should be able to fix it in
> i915, but since the allocation is done by the backing shmem this is
> not possible.
> 
> In the end, I came up with the ugly workaround of just leaking the
> offending pages in shmem.c. I do realize it's truly ugly, but I'm
> looking for a fix to the existing code, and am wondering if people on
> this list have a better idea, short of rewriting i915_gem.c to
> allocate its own pages directly.
> 
> Signed-off-by: Stephane Marchesin 

Well done for discovering and pursuing this issue, but of course (as
you know: you're trying to provoke us to better) your patch is revolting.

And not even enough: swapin readahead and swapoff can read back
from swap into pages which the i915 will later turn out to dislike.

I do have a shmem.c patch coming up for gma500, which cannot use pages
over 4GB; but that fits more reasonably with memory allocation policies,
where we expect that anyone who can use a high page can use a lower as
well, and there's already __GFP_DMA32 to set the limit.

Your limitation is at the opposite end, so that patch won't help you at
all.  And I don't see how Andi's ZONE_DMA exclusion would work, without
equal hackery to enable private zonelists, avoiding that convention.

i915 is not the only user of shmem, and x86 not the only architecture:
we're not going to make everyone suffer for this.  Once the memory
allocator gets down to giving you the low 1MB, my guess is that it's
already short of memory, and liable to deadlock or OOM if you refuse
and soak up every page it then gives you.  Even if i915 has to live
with that possibility, we're not going to extend it to everyone else.

arch/x86/Kconfig has X86_RESERVE_LOW, default 64, range 4 640 (and
I think we reserve all the memory range from 640kB to 1MB anyway).
Would setting that to 640 allow you to boot, and avoid the i915
problem on all but the odd five pages?  I'm not pretending that's
an ideal solution at all (unless freeing initmem could release most
of it on non-SandyBridge and non-i915 machines), but it would be
useful to know if that does provide a stopgap solution.  If that
does work, maybe we just mark the odd five PageReserved at startup.

Is there really no way this can be handled closer to the source of
the problem, in the i915 driver itself?  I do not know the flow of
control in i915 (and X) at all, but on the surface it seems that the
problem only comes when you map these problematic pages into the GTT
(if I'm even using the right terminology), and something (not shmem.c)
actively has to do that.

Can't you check the pfn at that point, and if it's an unsuitable page,
copy into a suitable page (perhaps allocated then, perhaps from a pool
you primed earlier) and map that suitable page into the GTT instead?
Maybe using page->private to link them if that helps.

So long as the page (or its shadow) is mapped into the GTT, I imagine
it would be pinned, and not liable to be swapped out or otherwise
interfered with by shmem.c.  And when you unmap it from GTT, copy
back to the unsuitable shmem object page before unpinning.

I fully accept that I have very little understanding of GPU DRM GTT
and i915, and this may be impossible or incoherent: but please, let's
try to keep the strangeness where it belongs.  If necessary, we'll
have add some kind of flag and callback from shmem.c to the driver;
but I'd so much prefer to avoid that.

Hugh

> 
> Change-Id: I957e125fb280e0b0d6b05a83cc4068df2f05aa0a
> ---
>  mm/shmem.c |   39 +--
>  1 files changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 6c253f7..dcbb58b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -768,6 +768,31 @@ redirty:
>   return 0;
>  }
>  
> +/*
> + * Some intel GPUs can't use those pages in the GTT, which results in
> + * graphics corruption. Sadly, it's impossible to prevent usage of those
> + * pages in the intel allocator.
> + *
> + * Instead, we test for those areas here and leak the corresponding pages.
> + *
> + * Some day, when the intel GPU memory is not backed by shmem any more,
> + * we'll be able to come up with a solution which is contained in i915.
> + */
> +static bool i915_usable_page(struct page *page)
> +{
> + dma_addr_t addr = page_to_phys(page);
> +
> + if 

i915 modeset memory corruption issues? (Fwd: Oops in ext3_block_to_path.isra.40+0x26/0x11b)

2012-03-19 Thread Hugh Dickins
On Sun, 18 Mar 2012, Rafael J. Wysocki wrote:
> On Sunday, March 18, 2012, Hugh Dickins wrote:
> > On Sat, 17 Mar 2012, Keith Packard wrote:
> > > On Sat, 17 Mar 2012 18:44:18 -0700 (PDT), Hugh Dickins  > > google.com> wrote:
> > > > I keep worrying about the sequence when the machine is powered on again
> > > > after hibernation: can i915 get up to anything before it is resumed from
> > > > the hibernation image?
...
> 
> Well, pretty much the only explanation of the observed symptoms I can imagine
> is some kind of "leakage" of the boot kernel's memory mappings through the
> graphics adapter into the post-restore system.

Thanks for your detailed confirmations, Rafael.  I see from other threads
that pretty much everyone has now converged upon this hypothesis - which
perhaps I read first elsewhere, forgot, and then thought it my idea!
I'll leave it to the i915 experts: good luck in tracking it down.

Hugh


i915 modeset memory corruption issues? (Fwd: Oops in ext3_block_to_path.isra.40+0x26/0x11b)

2012-03-19 Thread Hugh Dickins
On Sat, 17 Mar 2012, Linus Torvalds wrote:
> On Sat, Mar 17, 2012 at 3:09 PM, Hugh Dickins  wrote:
> >
> > I've got to go out for an hour: I'll digest more and think more about
> > this when I get back. ?If someone could explain the original problem
> > with _MOVABLE, that would help me:
> 
> I do not believe we actually ever uncovered the original problem with
> _MOVABLE: the problem was bisected down to the stable-backported
> version of commit 4bdadb978569 ("drm/i915: Selectively enable
> self-reclaim"), and I looked at the changes and decided that one of
> the main ones was the removal of the mapping_set_gfp_mask() - which
> resulted in __GFP_MOVABLE being on for the mapping.
> 
> So we just tried re-introducing that, and it fixed the problem.
> 
> Exactly *why* movable pages are a problem was never all that clear.
> ...

Thanks a lot for your considered and detailed reply on this
(before we added Rafael).

I've come to the (not entirely firm) conclusion that the addition and
removal of __GFP_MOVABLE was just shifting the shmem object allocations
from one block of memory (shared with others not using __GFP_MOVABLE)
to another (shared with others using __GFP_MOVABLE) and back.

So when the __GFP_MOVABLE inadvertently went in, a new group of users
who hadn't noticed the corruption before, now reported it; and when you
removed the __GFP_MOVABLE (a good idea anyway, pinned pages just clogging
up an otherwise movable block of memory), that group of users found their
problem solved.

Not really a problem from i915 specifying __GFP_MOVABLE on shmem objects.

Hugh


Re: i915 modeset memory corruption issues? (Fwd: Oops in ext3_block_to_path.isra.40+0x26/0x11b)

2012-03-19 Thread Hugh Dickins
On Sat, 17 Mar 2012, Linus Torvalds wrote:
 On Sat, Mar 17, 2012 at 3:09 PM, Hugh Dickins hu...@google.com wrote:
 
  I've got to go out for an hour: I'll digest more and think more about
  this when I get back.  If someone could explain the original problem
  with _MOVABLE, that would help me:
 
 I do not believe we actually ever uncovered the original problem with
 _MOVABLE: the problem was bisected down to the stable-backported
 version of commit 4bdadb978569 (drm/i915: Selectively enable
 self-reclaim), and I looked at the changes and decided that one of
 the main ones was the removal of the mapping_set_gfp_mask() - which
 resulted in __GFP_MOVABLE being on for the mapping.
 
 So we just tried re-introducing that, and it fixed the problem.
 
 Exactly *why* movable pages are a problem was never all that clear.
 ...

Thanks a lot for your considered and detailed reply on this
(before we added Rafael).

I've come to the (not entirely firm) conclusion that the addition and
removal of __GFP_MOVABLE was just shifting the shmem object allocations
from one block of memory (shared with others not using __GFP_MOVABLE)
to another (shared with others using __GFP_MOVABLE) and back.

So when the __GFP_MOVABLE inadvertently went in, a new group of users
who hadn't noticed the corruption before, now reported it; and when you
removed the __GFP_MOVABLE (a good idea anyway, pinned pages just clogging
up an otherwise movable block of memory), that group of users found their
problem solved.

Not really a problem from i915 specifying __GFP_MOVABLE on shmem objects.

Hugh___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: i915 modeset memory corruption issues? (Fwd: Oops in ext3_block_to_path.isra.40+0x26/0x11b)

2012-03-19 Thread Hugh Dickins
On Sun, 18 Mar 2012, Rafael J. Wysocki wrote:
 On Sunday, March 18, 2012, Hugh Dickins wrote:
  On Sat, 17 Mar 2012, Keith Packard wrote:
   On Sat, 17 Mar 2012 18:44:18 -0700 (PDT), Hugh Dickins hu...@google.com 
   wrote:
I keep worrying about the sequence when the machine is powered on again
after hibernation: can i915 get up to anything before it is resumed from
the hibernation image?
...
 
 Well, pretty much the only explanation of the observed symptoms I can imagine
 is some kind of leakage of the boot kernel's memory mappings through the
 graphics adapter into the post-restore system.

Thanks for your detailed confirmations, Rafael.  I see from other threads
that pretty much everyone has now converged upon this hypothesis - which
perhaps I read first elsewhere, forgot, and then thought it my idea!
I'll leave it to the i915 experts: good luck in tracking it down.

Hugh
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: i915 modeset memory corruption issues? (Fwd: Oops in ext3_block_to_path.isra.40+0x26/0x11b)

2012-03-18 Thread Hugh Dickins
On Sat, 17 Mar 2012, Linus Torvalds wrote:
 On Sat, Mar 17, 2012 at 1:23 PM, Dave Airlie airl...@gmail.com wrote:
 
  I did however get a flashback in google to this:
 
  http://lists.fedoraproject.org/pipermail/scm-commits/2010-July/456636.html
 
  Linus don't think we ever did work out why that worked, I wonder if we
  lost something after that.
 
 Hmm. Maybe we should stop making the default gfp_mask be
 GFP_HIGHMEM_MOVABLE (see inode_init_always() in fs/inode.c), and
 instead add the _MOVABLE flag only for mappings that use
 generic_file_mmap().
 
 It does sound a bit scary to just make default mappings use MOVABLE
 pages, when we know that can be incorrect for some cases.
 
 But that would require that filesystems like ext4 etc that don't just
 use the generic_file_mmap() function would have add do that MOVABLE
 thing on their own. And the *normal* case is certainly that pages are
 movable and putting them in themovable zone should be ok. It's just
 *not* ok if you also map them into some hardware GTT thing or just
 otherwise keep track of the pages directly, like DRI does.
 
 The other possibility is to just make this a shm thing, since it's
 generally that layer that has the case of pages allocated with the
 mapping gfp masks. Hugh Dickins clearly tried to make sure that the
 DRM initialization of the gfp mask was honored, but maybe there is
 some case where it is missed. Hugh added a mapping_set_gfp_mask() to
 i915_gem_alloc_object(), but maybe we have i915 shmem mappings that
 get set up through some other path.
 
 What about the ttm swap storage thing, for example? That also does
 shmem_file_setup(), without limiting the pages. But I don't think i915
 uses ttm, right?
 
 Al? Hugh? Opinions? Right now anybody who uses the
 shmem_read_mapping_page() function is in danger of getting a MOVABLE
 page by default. Which may or may not be right.
 
 That said, I don't see how i915 would get a movable page. It *seems*
 to do the right setup for the gfp flags of the mapping.

Yes, i915_gem_alloc_object() does its own mapping_set_gfp_mask(mapping,
GFP_HIGHUSER | __GFP_RECLAIMABLE), which should be as good as ever.

I've got to go out for an hour: I'll digest more and think more about
this when I get back.  If someone could explain the original problem
with _MOVABLE, that would help me: I didn't see an explanation in the
patch or in the bugzilla.  I would expect using _MOVABLE for i915_gem
would render a block of otherwise movable pages immovable because of
the GEM pages pinned, but I wouldn't expect them to move in mysterious
ways - or are references held to the pages without them being pinned??.

A factor which might turn out to be relevant: swapin_readahead() (or
swapoff) brings in pages from swap before it knows who they belong to,
so the swapped-in swapcache pages don't necessarily have the limitations
mapping_set_gfp_mask() has asked for.  It's been discussed with Alan,
we know it will need fixing for gma500 when eventually that comes to
be used on machines with more than 4GB, but for now I wasn't aware of
a problem.

Hugh
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: i915 modeset memory corruption issues? (Fwd: Oops in ext3_block_to_path.isra.40+0x26/0x11b)

2012-03-18 Thread Hugh Dickins
On Sat, 17 Mar 2012, Keith Packard wrote:
 On Sat, 17 Mar 2012 15:52:15 -0700, Linus Torvalds 
 torva...@linux-foundation.org wrote:
 
  I do not believe we actually ever uncovered the original problem with
  _MOVABLE: the problem was bisected down to the stable-backported
  version of commit 4bdadb978569 (drm/i915: Selectively enable
  self-reclaim), and I looked at the changes and decided that one of
  the main ones was the removal of the mapping_set_gfp_mask() - which
  resulted in __GFP_MOVABLE being on for the mapping.
 
 Can anyone explain what __GFP_MOVABLE even does? I can't understand what
 this flag would be for; if the page is locked (with get_page), then the
 page cannot move. If it isn't locked, then it's subject to swapping, in
 which case the page will almost certainly move when it returns from
 disk. Is it that the page won't move if it isn't swapped? That doesn't
 seem all that useful to me.

__GFP_MOVABLE is a hint to page allocation that there's a good likelihood
that this logical page can be migrated elsewhere in physical memory later
on if mm wants, so it's a good idea to allocate it from a physical area of
similarly MOVABLE pages; then if later on someone wants a large contiguous
area for something (or wants to hot-unplug that memory), it should be easy
to clear the whole area out, moving existing pages elsewhere.  (I think
that's right: several questions come to me as I write it, but now is not
the time to research all those details.)  Page migration can only be done
later if it can account for all of page_count(page).

 
  but I didn't actually see why the i915 page pinning would be defeated
  by __GFP_MOVABLE. The code does get a reference to them afaik.
 
 GTT mapping and page locking are done in lock-step in the driver:
 
 i915_gem_object_bind_to_gtt
 i915_gem_object_get_pages_gtt
 pins the pages
 i915_gem_gtt_bind_object
 maps to GTT
 
 i915_gem_object_unbind
   i915_gem_gtt_unbind_object
 unmaps from GTT
   i915_gem_object_put_pages_gtt
 unpins the pages.
 
 There are no other code paths to unmapping objects from the GTT or
 unpinning the pages that I can find.
 
  So for example, i915_gem_object_get_pages_gtt() will use
  shmem_read_mapping_page_gfp() which will increment the page count for
  the page it gets, so all the obj-pages[] entries should have properly
  incremented page counts. And they get released by
  i915_gem_object_put_pages_gtt(), but maybe that is called too early
  while the pages are still in use by the GFX unit...
 
 This seems the most likely problem -- there are so many caches and
 buffers involved. However, we're seeing troubles on hibernate resume, at
 which point there isn't any acceleration going on, just two fbdev
 drivers poking the hardware. Which really reduces the complexity quite a
 bit; it's just CPU reads/writes through the GTT aperture created for the
 two console frame buffers. That makes this an interesting place to look
 for trouble; we can ignore vast areas within the driver that deal with
 acceleration, at least for this case.

I keep worrying about the sequence when the machine is powered on again
after hibernation: can i915 get up to anything before it is resumed from
the hibernation image?  Get to use certain pages at that stage, then
continue to poke at them after the hibernation image is restored (which
changes the story of what pages are free and what are used for what):
lacking some kind of flush?

Hugh
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: i915 modeset memory corruption issues? (Fwd: Oops in ext3_block_to_path.isra.40+0x26/0x11b)

2012-03-18 Thread Hugh Dickins
Added Rafael to the Cc: Rafael, we're pondering over one or more of these
recurrent threads about corruption after resume, seemingly related to i915.

On Sat, 17 Mar 2012, Keith Packard wrote:
 On Sat, 17 Mar 2012 18:44:18 -0700 (PDT), Hugh Dickins hu...@google.com 
 wrote:
  I keep worrying about the sequence when the machine is powered on again
  after hibernation: can i915 get up to anything before it is resumed from
  the hibernation image?
 
 Well, the frame buffer is presumably still using whatever mapping it had
 before suspend occurred; is there any way it could be writing through
 that before the graphics driver was resumed?

It's hibernation restore here, so I don't think it could be using the
mapping from before hibernation until after resuming from hibernation
snapshot: it would be using the rebooting kernel's mapping until then.

 
 What I don't understand is the relationship between the boot kernel and
 the resumed kernel; when does the boot kernel stop writing to the
 console, and how does it hand off control of the frame buffer at that
 time.

I believe the handoff point comes in the late initcall software_resume():
which loads the image and calls hibernation_restore - resume_target_kernel
- swsusp_arch_resume, which emerges into the restored hibernation image.

As a late initcall, I imagine some work has already been done via the
framebuffer, but I have no conception of what kind of mappings that
involves (would shmem objects come into it at all? and is that even
a relevant question, could enough damage be done without them?), nor
whether they're properly torn down before emerging into the hibernimage.

 
 It would be great if we could separate out the boot kernel access to the
 graphics system from the resumed system -- if the boot kernel was run
 without the i915 driver loaded at all, and just used VGA text mode, then
 any damage as a result of resume wouldn't be caused by the boot kernel
 GTT mappings getting used at the wrong time.

But you're giving my worry more credence than it deserves there:
we don't have any evidence that this is where the problem lies,
that's just a suspicion of mine at the moment.

Hugh
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


i915 modeset memory corruption issues? (Fwd: Oops in ext3_block_to_path.isra.40+0x26/0x11b)

2012-03-17 Thread Hugh Dickins
Added Rafael to the Cc: Rafael, we're pondering over one or more of these
recurrent threads about corruption after resume, seemingly related to i915.

On Sat, 17 Mar 2012, Keith Packard wrote:
> On Sat, 17 Mar 2012 18:44:18 -0700 (PDT), Hugh Dickins  
> wrote:
> > I keep worrying about the sequence when the machine is powered on again
> > after hibernation: can i915 get up to anything before it is resumed from
> > the hibernation image?
> 
> Well, the frame buffer is presumably still using whatever mapping it had
> before suspend occurred; is there any way it could be writing through
> that before the graphics driver was resumed?

It's hibernation restore here, so I don't think it could be using the
mapping from before hibernation until after resuming from hibernation
snapshot: it would be using the rebooting kernel's mapping until then.

> 
> What I don't understand is the relationship between the boot kernel and
> the resumed kernel; when does the boot kernel stop writing to the
> console, and how does it hand off control of the frame buffer at that
> time.

I believe the handoff point comes in the late initcall software_resume():
which loads the image and calls hibernation_restore -> resume_target_kernel
-> swsusp_arch_resume, which emerges into the restored hibernation image.

As a late initcall, I imagine some work has already been done via the
framebuffer, but I have no conception of what kind of mappings that
involves (would shmem objects come into it at all? and is that even
a relevant question, could enough damage be done without them?), nor
whether they're properly torn down before emerging into the hibernimage.

> 
> It would be great if we could separate out the boot kernel access to the
> graphics system from the resumed system -- if the boot kernel was run
> without the i915 driver loaded at all, and just used VGA text mode, then
> any damage as a result of resume wouldn't be caused by the boot kernel
> GTT mappings getting used at the wrong time.

But you're giving my worry more credence than it deserves there:
we don't have any evidence that this is where the problem lies,
that's just a suspicion of mine at the moment.

Hugh


i915 modeset memory corruption issues? (Fwd: Oops in ext3_block_to_path.isra.40+0x26/0x11b)

2012-03-17 Thread Hugh Dickins
On Sat, 17 Mar 2012, Keith Packard wrote:
> On Sat, 17 Mar 2012 15:52:15 -0700, Linus Torvalds  linux-foundation.org> wrote:
> 
> > I do not believe we actually ever uncovered the original problem with
> > _MOVABLE: the problem was bisected down to the stable-backported
> > version of commit 4bdadb978569 ("drm/i915: Selectively enable
> > self-reclaim"), and I looked at the changes and decided that one of
> > the main ones was the removal of the mapping_set_gfp_mask() - which
> > resulted in __GFP_MOVABLE being on for the mapping.
> 
> Can anyone explain what __GFP_MOVABLE even does? I can't understand what
> this flag would be for; if the page is locked (with get_page), then the
> page cannot move. If it isn't locked, then it's subject to swapping, in
> which case the page will almost certainly move when it returns from
> disk. Is it that the page won't move if it isn't swapped? That doesn't
> seem all that useful to me.

__GFP_MOVABLE is a hint to page allocation that there's a good likelihood
that this logical page can be migrated elsewhere in physical memory later
on if mm wants, so it's a good idea to allocate it from a physical area of
similarly MOVABLE pages; then if later on someone wants a large contiguous
area for something (or wants to hot-unplug that memory), it should be easy
to clear the whole area out, moving existing pages elsewhere.  (I think
that's right: several questions come to me as I write it, but now is not
the time to research all those details.)  Page migration can only be done
later if it can account for all of page_count(page).

> 
> > but I didn't actually see why the i915 page pinning would be defeated
> > by __GFP_MOVABLE. The code does get a reference to them afaik.
> 
> GTT mapping and page locking are done in lock-step in the driver:
> 
> i915_gem_object_bind_to_gtt
> i915_gem_object_get_pages_gtt
> pins the pages
> i915_gem_gtt_bind_object
> maps to GTT
> 
> i915_gem_object_unbind
>   i915_gem_gtt_unbind_object
> unmaps from GTT
>   i915_gem_object_put_pages_gtt
> unpins the pages.
> 
> There are no other code paths to unmapping objects from the GTT or
> unpinning the pages that I can find.
> 
> > So for example, i915_gem_object_get_pages_gtt() will use
> > shmem_read_mapping_page_gfp() which will increment the page count for
> > the page it gets, so all the obj->pages[] entries should have properly
> > incremented page counts. And they get released by
> > i915_gem_object_put_pages_gtt(), but maybe that is called too early
> > while the pages are still in use by the GFX unit...
> 
> This seems the most likely problem -- there are so many caches and
> buffers involved. However, we're seeing troubles on hibernate resume, at
> which point there isn't any acceleration going on, just two fbdev
> drivers poking the hardware. Which really reduces the complexity quite a
> bit; it's just CPU reads/writes through the GTT aperture created for the
> two console frame buffers. That makes this an interesting place to look
> for trouble; we can ignore vast areas within the driver that deal with
> acceleration, at least for this case.

I keep worrying about the sequence when the machine is powered on again
after hibernation: can i915 get up to anything before it is resumed from
the hibernation image?  Get to use certain pages at that stage, then
continue to poke at them after the hibernation image is restored (which
changes the story of what pages are free and what are used for what):
lacking some kind of flush?

Hugh


i915 modeset memory corruption issues? (Fwd: Oops in ext3_block_to_path.isra.40+0x26/0x11b)

2012-03-17 Thread Hugh Dickins
On Sat, 17 Mar 2012, Linus Torvalds wrote:
> On Sat, Mar 17, 2012 at 1:23 PM, Dave Airlie  wrote:
> >
> > I did however get a flashback in google to this:
> >
> > http://lists.fedoraproject.org/pipermail/scm-commits/2010-July/456636.html
> >
> > Linus don't think we ever did work out why that worked, I wonder if we
> > lost something after that.
> 
> Hmm. Maybe we should stop making the default gfp_mask be
> GFP_HIGHMEM_MOVABLE (see inode_init_always() in fs/inode.c), and
> instead add the _MOVABLE flag only for mappings that use
> "generic_file_mmap()".
> 
> It does sound a bit scary to just make default mappings use MOVABLE
> pages, when we know that can be incorrect for some cases.
> 
> But that would require that filesystems like ext4 etc that don't just
> use the generic_file_mmap() function would have add do that MOVABLE
> thing on their own. And the *normal* case is certainly that pages are
> movable and putting them in themovable zone should be ok. It's just
> *not* ok if you also map them into some hardware GTT thing or just
> otherwise keep track of the pages directly, like DRI does.
> 
> The other possibility is to just make this a shm thing, since it's
> generally that layer that has the case of "pages allocated with the
> mapping gfp masks". Hugh Dickins clearly tried to make sure that the
> DRM initialization of the gfp mask was honored, but maybe there is
> some case where it is missed. Hugh added a mapping_set_gfp_mask() to
> i915_gem_alloc_object(), but maybe we have i915 shmem mappings that
> get set up through some other path.
> 
> What about the ttm swap storage thing, for example? That also does
> shmem_file_setup(), without limiting the pages. But I don't think i915
> uses ttm, right?
> 
> Al? Hugh? Opinions? Right now anybody who uses the
> shmem_read_mapping_page() function is in danger of getting a MOVABLE
> page by default. Which may or may not be right.
> 
> That said, I don't see how i915 would get a movable page. It *seems*
> to do the right setup for the gfp flags of the mapping.

Yes, i915_gem_alloc_object() does its own mapping_set_gfp_mask(mapping,
GFP_HIGHUSER | __GFP_RECLAIMABLE), which should be as good as ever.

I've got to go out for an hour: I'll digest more and think more about
this when I get back.  If someone could explain the original problem
with _MOVABLE, that would help me: I didn't see an explanation in the
patch or in the bugzilla.  I would expect using _MOVABLE for i915_gem
would render a block of otherwise movable pages immovable because of
the GEM pages pinned, but I wouldn't expect them to move in mysterious
ways - or are references held to the pages without them being pinned??.

A factor which might turn out to be relevant: swapin_readahead() (or
swapoff) brings in pages from swap before it knows who they belong to,
so the swapped-in swapcache pages don't necessarily have the limitations
mapping_set_gfp_mask() has asked for.  It's been discussed with Alan,
we know it will need fixing for gma500 when eventually that comes to
be used on machines with more than 4GB, but for now I wasn't aware of
a problem.

Hugh