On Tue, Sep 19, 2023 at 10:26:42AM +0200, Oleksandr Natalenko wrote: > Andrzej asked me to try to revert commits 0b62af28f249, e0b72c14d8dc and > 1e0877d58b1e, and reverting those fixed the i915 crash for me. The > e0b72c14d8dc and 1e0877d58b1e commits look like just prerequisites, so I > assume 0b62af28f249 ("i915: convert shmem_sg_free_table() to use a > folio_batch") is the culprit here. > > Could you please check this? > > Our conversation with Andrzej is available at drm-intel GitLab [1]. > > Thanks. > > [1] https://gitlab.freedesktop.org/drm/intel/-/issues/9256
Wow, that is some great debugging. Thanks for all the time & effort you and others have invested. Sorry for breaking your system. You're almost right about the "prerequisites", but it's in the other direction; 0b62af28f249 is a prerequisite for the later two cleanups, so reverting all three is necessary to test 0b62af28f249. It seems to me that you've isolated the problem to constructing overly long sg lists. I didn't realise that was going to be a problem, so that's my fault. Could I ask you to try this patch? I'll follow up with another patch later because I think I made another assumption that may not be valid. diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c index 8f1633c3fb93..73a4a4eb29e0 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c @@ -100,6 +100,7 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st, st->nents = 0; for (i = 0; i < page_count; i++) { struct folio *folio; + unsigned long nr_pages; const unsigned int shrink[] = { I915_SHRINK_BOUND | I915_SHRINK_UNBOUND, 0, @@ -150,6 +151,8 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st, } } while (1); + nr_pages = min_t(unsigned long, + folio_nr_pages(folio), page_count - i); if (!i || sg->length >= max_segment || folio_pfn(folio) != next_pfn) { @@ -157,13 +160,13 @@ int shmem_sg_alloc_table(struct drm_i915_private *i915, struct sg_table *st, sg = sg_next(sg); st->nents++; - sg_set_folio(sg, folio, folio_size(folio), 0); + sg_set_folio(sg, folio, nr_pages * PAGE_SIZE, 0); } else { /* XXX: could overflow? */ - sg->length += folio_size(folio); + sg->length += nr_pages * PAGE_SIZE; } - next_pfn = folio_pfn(folio) + folio_nr_pages(folio); - i += folio_nr_pages(folio) - 1; + next_pfn = folio_pfn(folio) + nr_pages; + i += nr_pages - 1; /* Check that the i965g/gm workaround works. */ GEM_BUG_ON(gfp & __GFP_DMA32 && next_pfn >= 0x00100000UL);