Re: [Intel-gfx] [PATCH v3] drm/i915: Refactor PAT/object cache handling

2023-07-21 Thread Matt Roper
On Thu, Jul 20, 2023 at 09:28:56PM -0700, Yang, Fei wrote:
> >>> [snip]
> > @@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct
> > drm_i915_gem_object *obj)
> 
>  The code change here looks accurate, but while we're here, I have a
>  side question about this function in general...it was originally
>  introduced in commit 48004881f693 ("drm/i915: Mark CPU cache as
>  dirty when used for
>  rendering") which states that GPU rendering ends up in the CPU cache
>  (and thus needs a clflush later to make sure it lands in memory).
>  That makes sense to me for LLC platforms, but is it really true for
>  non-LLC snooping platforms (like MTL) as the commit states?
> >>>
> >>> For non-LLC platforms objects can be set to 1-way coherent which
> >>> means GPU rendering ending up in CPU cache as well, so for non-LLC
> >>> platform the logic here should be checking 1-way coherent flag.
> >>
> >> That's the part that I'm questioning (and not just for MTL, but for
> >> all of our other non-LLC platforms too).  Just because there's
> >> coherency doesn't mean that device writes landed in the CPU cache.
> >> Coherency is also achieved if device writes invalidate the contents of the 
> >> CPU cache.
> >> I thought our non-LLC snooping platforms were coherent due to
> >> write-invalidate rather than write-update, but I can't find it
> >> specifically documented anywhere at the moment.  If write-invalidate
> >> was used, then there shouldn't be a need for a later clflush either.
> >
> > [Trying to consolidate by doing a combined reply to the discussion so far.]
> >
> > On the write-invalidate vs write-update I don't know. If you did not
> > find it in bspec then I doubt I would. I can have a browse still.
> 
> Matt was correct. Quote Ron Silvas from SW ARCH, "MTL GPU doesn't write to
> CPU cache, it simply snoop CPU cache on its way to RAM."

That's good to know.  We probably also want to find out if the same is
true for old snooping platforms (i.e., things like VLV/CHV/BXT) and/or
dgpu's (where I think the behavior is probably defined by the pcie spec,
but I'm not sure where to look for that).

> 
>  My understanding
>  was that snooping platforms just invalidated the CPU cache to
>  prevent future CPU reads from seeing stale data but didn't actually
>  stick any new data in there?  Am I off track or is the original
>  logic of this function not quite right?
> 
>  Anyway, even if the logic of this function is wrong, it's a mistake
>  that would only hurt performance
> >>>
> >>> Yes, this logic will introduce performance impact because it's
> >>> missing the checking for obj->pat_set_by_user. For objects with
> >>> pat_set_by_user==true, even if the object is snooping or 1-way
> >>> coherent, we don't want to enforce a clflush here since the
> >>> coherency is supposed to be handled by user space.
> >
> > What should I add you think to fix it?
> 
> I think the simplest would be
> 
> if (obj->pat_set_by_user)
> return false;
> 
> because even checking for incoherent WB is unnecessary, simply no
> need for the KMD to initiate a flush if PAT is set by user.

I guess one thing we're missing today is a well-documented explanation
of the expectations for i915_gem_set_domain_ioctl behavior when we're
using a user-defined PAT?


Matt

> 
> > Add a check for non-coherent WB in gpu_write_needs_clflush as an additional 
> > condition for returning false?
> >
> > And then if Matt is correct write-invalidate is used also !HAS_LLC should 
> > just return false?
> >
>  (flushing more often than we truly need to) rather than
>  functionality, so not something we really need to dig into right now
>  as part of this patch.
> 
> >   if (IS_DGFX(i915))
> >   return false;
> >
> > -/*
> > - * For objects created by userspace through GEM_CREATE with 
> > pat_index
> > - * set by set_pat extension, i915_gem_object_has_cache_level() will
> > - * always return true, because the coherency of such object is 
> > managed
> > - * by userspace. Othereise the call here would fall back to 
> > checking
> > - * whether the object is un-cached or write-through.
> > - */
> > -return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) ||
> > - i915_gem_object_has_cache_level(obj, I915_CACHE_WT));
> > +return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 
> > 1 &&
> > +   i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 
> > 1;
> >   }
> >>>
> >>> [snip]
> > @@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct 
> > reloc_cache *cache,
> >   if (DBG_FORCE_RELOC == FORCE_GTT_RELOC)
> >   return false;
> >
> > -/*
> > - * For objects created by userspace through GEM_CREATE with 
> > pat_index
> 

Re: [Intel-gfx] [PATCH v3] drm/i915: Refactor PAT/object cache handling

2023-07-21 Thread Tvrtko Ursulin



On 21/07/2023 05:28, Yang, Fei wrote:

[snip]

@@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct
drm_i915_gem_object *obj)


The code change here looks accurate, but while we're here, I have a
side question about this function in general...it was originally
introduced in commit 48004881f693 ("drm/i915: Mark CPU cache as
dirty when used for
rendering") which states that GPU rendering ends up in the CPU cache
(and thus needs a clflush later to make sure it lands in memory).
That makes sense to me for LLC platforms, but is it really true for
non-LLC snooping platforms (like MTL) as the commit states?


For non-LLC platforms objects can be set to 1-way coherent which
means GPU rendering ending up in CPU cache as well, so for non-LLC
platform the logic here should be checking 1-way coherent flag.


That's the part that I'm questioning (and not just for MTL, but for
all of our other non-LLC platforms too).  Just because there's
coherency doesn't mean that device writes landed in the CPU cache.
Coherency is also achieved if device writes invalidate the contents of the CPU 
cache.
I thought our non-LLC snooping platforms were coherent due to
write-invalidate rather than write-update, but I can't find it
specifically documented anywhere at the moment.  If write-invalidate
was used, then there shouldn't be a need for a later clflush either.


[Trying to consolidate by doing a combined reply to the discussion so far.]

On the write-invalidate vs write-update I don't know. If you did not
find it in bspec then I doubt I would. I can have a browse still.


Matt was correct. Quote Ron Silvas from SW ARCH, "MTL GPU doesn't write to
CPU cache, it simply snoop CPU cache on its way to RAM."


Does it apply to all snooping platforms?

And for the cache level/mode based condition, how about replacing it with this:

/*
 * Fully coherent cached access may end up with data in the CPU cache
 * which hasn't hit memory yet.
 */
return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WB) &&
   i915_gem_object_has_cache_flag(obj, I915_CACHE_FLAG_COH2W);

?

Although that would mean old I915_CACHE_LLC on old platforms is actually 2-way 
coherent.

I am struggling to find a comprehensive explanation in bspec, but for instance 
605 makes it sounds like it is fully coherent. Perhaps it really is and I 
should fix the legacy and Gen12 table..

And if the write-invalidate applies to all snooping platforms then we extend it 
to:

/*
 * Fully coherent cached access may end up with data in the CPU cache
 * which hasn't hit memory yet.
 *
 * But not on snooping platforms, where it is impossible due
 * write-invalidate.
 */
return !HAS_SNOOP(i915) &&
   (i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WB) &&
i915_gem_object_has_cache_flag(obj, I915_CACHE_FLAG_COH2W));

That would prevent any flushing on MTL and make you happy from that aspect.

In fact, the snooping check could be before the cache mode check.

For i915_gem_object_can_bypass_llc it would be ideal if a condition based on 
the absence of I915_BO_CACHE_COHERENT_FOR_WRITE would work. At least according 
to the kerneldoc for @cache_coherent:

 * I915_BO_CACHE_COHERENT_FOR_WRITE:
 *
 * When writing through the CPU cache, the GPU is still coherent. Note
 * that this also implies I915_BO_CACHE_COHERENT_FOR_READ.

So for objects without it set, we need to force a flush.

And make __i915_gem_object_update_coherency not set it for WB without 1-way 
coherency set.

According to bspec that would seem correct, because with 1-way snooping on MTL, 
GPU snoops the IA until first GPU access. So anything the CPU writes before the 
first GPU access would be coherent and so no need to flush in set pages. But if 
non-coherent WB is set then we need to flush.

I'll trybot it is and see what happens.


My understanding
was that snooping platforms just invalidated the CPU cache to
prevent future CPU reads from seeing stale data but didn't actually
stick any new data in there?  Am I off track or is the original
logic of this function not quite right?

Anyway, even if the logic of this function is wrong, it's a mistake
that would only hurt performance


Yes, this logic will introduce performance impact because it's
missing the checking for obj->pat_set_by_user. For objects with
pat_set_by_user==true, even if the object is snooping or 1-way
coherent, we don't want to enforce a clflush here since the
coherency is supposed to be handled by user space.


What should I add you think to fix it?


I think the simplest would be

 if (obj->pat_set_by_user)
 return false;

because even checking for incoherent WB is unnecessary, simply no
need for the KMD to initiate a flush if PAT is set by user.


Add a check for non-coherent WB in gpu_write_needs_clflush as an additional 
condition for returning 

Re: [Intel-gfx] [PATCH v3] drm/i915: Refactor PAT/object cache handling

2023-07-20 Thread Yang, Fei
>>> [snip]
> @@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct
> drm_i915_gem_object *obj)

 The code change here looks accurate, but while we're here, I have a
 side question about this function in general...it was originally
 introduced in commit 48004881f693 ("drm/i915: Mark CPU cache as
 dirty when used for
 rendering") which states that GPU rendering ends up in the CPU cache
 (and thus needs a clflush later to make sure it lands in memory).
 That makes sense to me for LLC platforms, but is it really true for
 non-LLC snooping platforms (like MTL) as the commit states?
>>>
>>> For non-LLC platforms objects can be set to 1-way coherent which
>>> means GPU rendering ending up in CPU cache as well, so for non-LLC
>>> platform the logic here should be checking 1-way coherent flag.
>>
>> That's the part that I'm questioning (and not just for MTL, but for
>> all of our other non-LLC platforms too).  Just because there's
>> coherency doesn't mean that device writes landed in the CPU cache.
>> Coherency is also achieved if device writes invalidate the contents of the 
>> CPU cache.
>> I thought our non-LLC snooping platforms were coherent due to
>> write-invalidate rather than write-update, but I can't find it
>> specifically documented anywhere at the moment.  If write-invalidate
>> was used, then there shouldn't be a need for a later clflush either.
>
> [Trying to consolidate by doing a combined reply to the discussion so far.]
>
> On the write-invalidate vs write-update I don't know. If you did not
> find it in bspec then I doubt I would. I can have a browse still.

Matt was correct. Quote Ron Silvas from SW ARCH, "MTL GPU doesn't write to
CPU cache, it simply snoop CPU cache on its way to RAM."

 My understanding
 was that snooping platforms just invalidated the CPU cache to
 prevent future CPU reads from seeing stale data but didn't actually
 stick any new data in there?  Am I off track or is the original
 logic of this function not quite right?

 Anyway, even if the logic of this function is wrong, it's a mistake
 that would only hurt performance
>>>
>>> Yes, this logic will introduce performance impact because it's
>>> missing the checking for obj->pat_set_by_user. For objects with
>>> pat_set_by_user==true, even if the object is snooping or 1-way
>>> coherent, we don't want to enforce a clflush here since the
>>> coherency is supposed to be handled by user space.
>
> What should I add you think to fix it?

I think the simplest would be

if (obj->pat_set_by_user)
return false;

because even checking for incoherent WB is unnecessary, simply no
need for the KMD to initiate a flush if PAT is set by user.

> Add a check for non-coherent WB in gpu_write_needs_clflush as an additional 
> condition for returning false?
>
> And then if Matt is correct write-invalidate is used also !HAS_LLC should 
> just return false?
>
 (flushing more often than we truly need to) rather than
 functionality, so not something we really need to dig into right now
 as part of this patch.

>   if (IS_DGFX(i915))
>   return false;
>
> -/*
> - * For objects created by userspace through GEM_CREATE with pat_index
> - * set by set_pat extension, i915_gem_object_has_cache_level() will
> - * always return true, because the coherency of such object is 
> managed
> - * by userspace. Othereise the call here would fall back to checking
> - * whether the object is un-cached or write-through.
> - */
> -return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) ||
> - i915_gem_object_has_cache_level(obj, I915_CACHE_WT));
> +return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1 
> &&
> +   i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 1;
>   }
>>>
>>> [snip]
> @@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct 
> reloc_cache *cache,
>   if (DBG_FORCE_RELOC == FORCE_GTT_RELOC)
>   return false;
>
> -/*
> - * For objects created by userspace through GEM_CREATE with pat_index
> - * set by set_pat extension, i915_gem_object_has_cache_level() always
> - * return true, otherwise the call would fall back to checking 
> whether
> - * the object is un-cached.
> - */
>   return (cache->has_llc ||
>   obj->cache_dirty ||
> -!i915_gem_object_has_cache_level(obj, I915_CACHE_NONE));
> +i915_gem_object_has_cache_mode(obj,
> + I915_CACHE_MODE_UC) != 1);

 Platforms with relocations and platforms with user-specified PAT
 have no overlap, right?  So a -1 return should be impossible here
 and this is one case where we could just treat the return value as
 a boolean, right?
>>>
>
> Hm no, or m

Re: [Intel-gfx] [PATCH v3] drm/i915: Refactor PAT/object cache handling

2023-07-20 Thread Tvrtko Ursulin



[Here let me just focus on the points which did not get further discussion in 
follow ups yet.]

On 19/07/2023 23:31, Matt Roper wrote:

On Wed, Jul 19, 2023 at 01:37:30PM +0100, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

Commit 9275277d5324 ("drm/i915: use pat_index instead of cache_level") has
introduced PAT indices to i915 internal APIs, partially replacing the
usage of driver internal cache_level, but has also added a few
questionable design decisions which this patch tries to improve upon.

Principal change is to invert the per platform cache level to PAT index
table which was added by the referenced commit, and by doing so enable
i915 to understand the cache mode between PAT indices, changing them from
opaque to transparent.

Once we have the inverted table we are able to remove the hidden false
"return true" from i915_gem_object_has_cache_level.

Other changes/fixes/improvements we are able to do:

1)
Replace the enum i915_cache_level with i915_cache_t, composed of a more
detailed representation of each cache mode (base mode plus flags).

For instance this way we are able to express the difference between WB and
1-way coherent WB on Meteorlake. Which in turn enables us to map the i915
"cached" mode to the correct Meteorlake PAT index.

2)
We can cache PAT indices of the caching modes used by the driver itself in
struct drm_i915_private, which eliminates the runtime calls to
i915_gem_get_pat_index from both high- and low-level i915 components.

3)
We can also cache the caching modes used by the driver for coherent
access and for display buffers.

4)
Remove the incorrect references to enum i915_cache_level from low level
PTE encode vfuncs, since those are actually given PAT indices by their
callers.

5)
Because i915 now understands PAT indices, we can remove the overly
aggressive flushing triggered from i915_gem_object_can_bypass_llc() and
limit it to non-coherent write-back mode only.

6)
Finally we are able to replace the platform dependent cache mode to string
code in debugfs and elsewhere by the single implementation based on
i915_cache_t.

v2:
  * Fix PAT-to-cache-mode table for PVC. (Fei)
  * Cache display caching mode too. (Fei)
  * Improve and document criteria in i915_gem_object_can_bypass_llc() (Matt)

v3:
  * Checkpath issues.
  * Cache mode flags check fixed.

Signed-off-by: Tvrtko Ursulin 
Fixes: 9275277d5324 ("drm/i915: use pat_index instead of cache_level")
Cc: Chris Wilson 
Cc: Fei Yang 
Cc: Andi Shyti 
Cc: Matt Roper 
---
  drivers/gpu/drm/i915/Makefile |   1 +
  .../drm/i915/display/intel_plane_initial.c|   3 +-
  drivers/gpu/drm/i915/gem/i915_gem_domain.c|  56 ---
  drivers/gpu/drm/i915/gem/i915_gem_domain.h|   5 +-
  .../gpu/drm/i915/gem/i915_gem_execbuffer.c|  13 +-
  drivers/gpu/drm/i915/gem/i915_gem_internal.c  |   4 +-
  drivers/gpu/drm/i915/gem/i915_gem_mman.c  |  12 +-
  drivers/gpu/drm/i915/gem/i915_gem_object.c| 152 +++---
  drivers/gpu/drm/i915/gem/i915_gem_object.h|  11 +-
  .../gpu/drm/i915/gem/i915_gem_object_types.h  | 116 +
  drivers/gpu/drm/i915/gem/i915_gem_shmem.c |   8 +-
  drivers/gpu/drm/i915/gem/i915_gem_stolen.c|  11 +-
  drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c  |  44 ++---
  drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |   2 +-
  .../drm/i915/gem/selftests/huge_gem_object.c  |   4 +-
  .../gpu/drm/i915/gem/selftests/huge_pages.c   |   6 +-
  drivers/gpu/drm/i915/gt/gen6_ppgtt.c  |   4 +-
  drivers/gpu/drm/i915/gt/gen8_ppgtt.c  |  19 +--
  drivers/gpu/drm/i915/gt/intel_engine_cs.c |   2 +-
  drivers/gpu/drm/i915/gt/intel_ggtt.c  |  33 ++--
  drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c |   4 +-
  drivers/gpu/drm/i915/gt/intel_gtt.c   |   2 +-
  drivers/gpu/drm/i915/gt/intel_gtt.h   |   3 +-
  drivers/gpu/drm/i915/gt/intel_migrate.c   |  11 +-
  drivers/gpu/drm/i915/gt/intel_ppgtt.c |   6 +-
  .../gpu/drm/i915/gt/intel_ring_submission.c   |   4 +-
  drivers/gpu/drm/i915/gt/intel_timeline.c  |   2 +-
  drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   2 +-
  drivers/gpu/drm/i915/gt/selftest_migrate.c|   9 +-
  drivers/gpu/drm/i915/gt/selftest_reset.c  |  14 +-
  drivers/gpu/drm/i915/gt/selftest_tlb.c|   5 +-
  .../gpu/drm/i915/gt/selftest_workarounds.c|   2 +-
  drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c  |   8 +-
  drivers/gpu/drm/i915/i915_cache.c |  91 +++
  drivers/gpu/drm/i915/i915_cache.h |  60 +++
  drivers/gpu/drm/i915/i915_debugfs.c   |  53 +-
  drivers/gpu/drm/i915/i915_driver.c|   5 +
  drivers/gpu/drm/i915/i915_drv.h   |   5 +
  drivers/gpu/drm/i915/i915_gem.c   |  21 +--
  drivers/gpu/drm/i915/i915_gpu_error.c |   7 +-
  drivers/gpu/drm/i915/i915_pci.c   |  82 +-
  drivers/gpu/drm/i915/i915_perf.c  |   2 +-
  drivers/gpu/drm/i915/intel_device_info.h  |

Re: [Intel-gfx] [PATCH v3] drm/i915: Refactor PAT/object cache handling

2023-07-20 Thread Tvrtko Ursulin



On 20/07/2023 01:22, Matt Roper wrote:

On Wed, Jul 19, 2023 at 05:07:15PM -0700, Yang, Fei wrote:

[snip]

@@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct 
drm_i915_gem_object *obj)


The code change here looks accurate, but while we're here, I have a side
question about this function in general...it was originally introduced
in commit 48004881f693 ("drm/i915: Mark CPU cache as dirty when used for
rendering") which states that GPU rendering ends up in the CPU cache
(and thus needs a clflush later to make sure it lands in memory).  That
makes sense to me for LLC platforms, but is it really true for non-LLC
snooping platforms (like MTL) as the commit states?


For non-LLC platforms objects can be set to 1-way coherent which means
GPU rendering ending up in CPU cache as well, so for non-LLC platform
the logic here should be checking 1-way coherent flag.


That's the part that I'm questioning (and not just for MTL, but for all
of our other non-LLC platforms too).  Just because there's coherency
doesn't mean that device writes landed in the CPU cache.  Coherency is
also achieved if device writes invalidate the contents of the CPU cache.
I thought our non-LLC snooping platforms were coherent due to
write-invalidate rather than write-update, but I can't find it
specifically documented anywhere at the moment.  If write-invalidate was
used, then there shouldn't be a need for a later clflush either.


[Trying to consolidate by doing a combined reply to the discussion so far.]

On the write-invalidate vs write-update I don't know. If you did not 
find it in bspec then I doubt I would. I can have a browse still.



My understanding
was that snooping platforms just invalidated the CPU cache to prevent
future CPU reads from seeing stale data but didn't actually stick any
new data in there?  Am I off track or is the original logic of this
function not quite right?

Anyway, even if the logic of this function is wrong, it's a mistake that
would only hurt performance


Yes, this logic will introduce performance impact because it's missing the
checking for obj->pat_set_by_user. For objects with pat_set_by_user==true,
even if the object is snooping or 1-way coherent, we don't want to enforce
a clflush here since the coherency is supposed to be handled by user space.


What should I add you think to fix it?

Add a check for non-coherent WB in gpu_write_needs_clflush as an 
additional condition for returning false?


And then if Matt is correct write-invalidate is used also !HAS_LLC 
should just return false?



(flushing more often than we truly need to)
rather than functionality, so not something we really need to dig into
right now as part of this patch.


  if (IS_DGFX(i915))
  return false;

-/*
- * For objects created by userspace through GEM_CREATE with pat_index
- * set by set_pat extension, i915_gem_object_has_cache_level() will
- * always return true, because the coherency of such object is managed
- * by userspace. Othereise the call here would fall back to checking
- * whether the object is un-cached or write-through.
- */
-return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) ||
- i915_gem_object_has_cache_level(obj, I915_CACHE_WT));
+return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1 &&
+   i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 1;
  }


[snip]

@@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct reloc_cache 
*cache,
  if (DBG_FORCE_RELOC == FORCE_GTT_RELOC)
  return false;

-/*
- * For objects created by userspace through GEM_CREATE with pat_index
- * set by set_pat extension, i915_gem_object_has_cache_level() always
- * return true, otherwise the call would fall back to checking whether
- * the object is un-cached.
- */
  return (cache->has_llc ||
  obj->cache_dirty ||
-!i915_gem_object_has_cache_level(obj, I915_CACHE_NONE));
+i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1);


Platforms with relocations and platforms with user-specified PAT have no
overlap, right?  So a -1 return should be impossible here and this is
one case where we could just treat the return value as a boolean, right?




Hm no, or maybe. My thinking behind tri-state is to allow a safe option 
for "don't know". In case PAT index to cache mode table is not fully 
populated on some future platform.



My understanding is that the condition here means to say that, if GPU
access is uncached, don't use CPU reloc because the CPU cache might
contain stale data. This condition is sufficient for snooping platforms.
But from MTL onward, the condition show be whether the GPU access is
coherent with CPU. So, we should be checking 1-way coherent flag instead
of UC mode, because even if the GPU access is WB, it's still non-coherent,
thus CPU cache could be out-dated.


Honestly the matrix of caching decision/logic i

Re: [Intel-gfx] [PATCH v3] drm/i915: Refactor PAT/object cache handling

2023-07-19 Thread Matt Roper
On Wed, Jul 19, 2023 at 05:07:15PM -0700, Yang, Fei wrote:
> [snip]
> >> @@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct 
> >> drm_i915_gem_object *obj)
> >
> > The code change here looks accurate, but while we're here, I have a side
> > question about this function in general...it was originally introduced
> > in commit 48004881f693 ("drm/i915: Mark CPU cache as dirty when used for
> > rendering") which states that GPU rendering ends up in the CPU cache
> > (and thus needs a clflush later to make sure it lands in memory).  That
> > makes sense to me for LLC platforms, but is it really true for non-LLC
> > snooping platforms (like MTL) as the commit states?
> 
> For non-LLC platforms objects can be set to 1-way coherent which means
> GPU rendering ending up in CPU cache as well, so for non-LLC platform
> the logic here should be checking 1-way coherent flag.

That's the part that I'm questioning (and not just for MTL, but for all
of our other non-LLC platforms too).  Just because there's coherency
doesn't mean that device writes landed in the CPU cache.  Coherency is
also achieved if device writes invalidate the contents of the CPU cache.
I thought our non-LLC snooping platforms were coherent due to
write-invalidate rather than write-update, but I can't find it
specifically documented anywhere at the moment.  If write-invalidate was
used, then there shouldn't be a need for a later clflush either.

> 
> > My understanding
> > was that snooping platforms just invalidated the CPU cache to prevent
> > future CPU reads from seeing stale data but didn't actually stick any
> > new data in there?  Am I off track or is the original logic of this
> > function not quite right?
> >
> > Anyway, even if the logic of this function is wrong, it's a mistake that
> > would only hurt performance
> 
> Yes, this logic will introduce performance impact because it's missing the
> checking for obj->pat_set_by_user. For objects with pat_set_by_user==true,
> even if the object is snooping or 1-way coherent, we don't want to enforce
> a clflush here since the coherency is supposed to be handled by user space.
> 
> > (flushing more often than we truly need to)
> > rather than functionality, so not something we really need to dig into
> > right now as part of this patch.
> >
> >>  if (IS_DGFX(i915))
> >>  return false;
> >>
> >> -/*
> >> - * For objects created by userspace through GEM_CREATE with pat_index
> >> - * set by set_pat extension, i915_gem_object_has_cache_level() will
> >> - * always return true, because the coherency of such object is managed
> >> - * by userspace. Othereise the call here would fall back to checking
> >> - * whether the object is un-cached or write-through.
> >> - */
> >> -return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) ||
> >> - i915_gem_object_has_cache_level(obj, I915_CACHE_WT));
> >> +return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1 &&
> >> +   i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 1;
> >>  }
> 
> [snip]
> >> @@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct 
> >> reloc_cache *cache,
> >>  if (DBG_FORCE_RELOC == FORCE_GTT_RELOC)
> >>  return false;
> >>
> >> -/*
> >> - * For objects created by userspace through GEM_CREATE with pat_index
> >> - * set by set_pat extension, i915_gem_object_has_cache_level() always
> >> - * return true, otherwise the call would fall back to checking whether
> >> - * the object is un-cached.
> >> - */
> >>  return (cache->has_llc ||
> >>  obj->cache_dirty ||
> >> -!i915_gem_object_has_cache_level(obj, I915_CACHE_NONE));
> >> +i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1);
> >
> > Platforms with relocations and platforms with user-specified PAT have no
> > overlap, right?  So a -1 return should be impossible here and this is
> > one case where we could just treat the return value as a boolean, right?
> 
> My understanding is that the condition here means to say that, if GPU
> access is uncached, don't use CPU reloc because the CPU cache might
> contain stale data. This condition is sufficient for snooping platforms.
> But from MTL onward, the condition show be whether the GPU access is
> coherent with CPU. So, we should be checking 1-way coherent flag instead
> of UC mode, because even if the GPU access is WB, it's still non-coherent,
> thus CPU cache could be out-dated.

My point is that this is relocation code --- it should be impossible to
get here on MTL and beyond, right?  So user-provided PAT isn't a
consideration.

> 
> [snip]
> >> @@ -208,12 +230,6 @@ bool i915_gem_object_can_bypass_llc(struct 
> >> drm_i915_gem_object *obj)
> >>  if (!(obj->flags & I915_BO_ALLOC_USER))
> >>  return false;
> >>
> >> -/*
> >> - * Always flush cache for UMD objects at creation time.
> >> - */
> >> -if (ob

Re: [Intel-gfx] [PATCH v3] drm/i915: Refactor PAT/object cache handling

2023-07-19 Thread Yang, Fei
[snip]
>> @@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct 
>> drm_i915_gem_object *obj)
>
> The code change here looks accurate, but while we're here, I have a side
> question about this function in general...it was originally introduced
> in commit 48004881f693 ("drm/i915: Mark CPU cache as dirty when used for
> rendering") which states that GPU rendering ends up in the CPU cache
> (and thus needs a clflush later to make sure it lands in memory).  That
> makes sense to me for LLC platforms, but is it really true for non-LLC
> snooping platforms (like MTL) as the commit states?

For non-LLC platforms objects can be set to 1-way coherent which means
GPU rendering ending up in CPU cache as well, so for non-LLC platform
the logic here should be checking 1-way coherent flag.

> My understanding
> was that snooping platforms just invalidated the CPU cache to prevent
> future CPU reads from seeing stale data but didn't actually stick any
> new data in there?  Am I off track or is the original logic of this
> function not quite right?
>
> Anyway, even if the logic of this function is wrong, it's a mistake that
> would only hurt performance

Yes, this logic will introduce performance impact because it's missing the
checking for obj->pat_set_by_user. For objects with pat_set_by_user==true,
even if the object is snooping or 1-way coherent, we don't want to enforce
a clflush here since the coherency is supposed to be handled by user space.

> (flushing more often than we truly need to)
> rather than functionality, so not something we really need to dig into
> right now as part of this patch.
>
>>  if (IS_DGFX(i915))
>>  return false;
>>
>> -/*
>> - * For objects created by userspace through GEM_CREATE with pat_index
>> - * set by set_pat extension, i915_gem_object_has_cache_level() will
>> - * always return true, because the coherency of such object is managed
>> - * by userspace. Othereise the call here would fall back to checking
>> - * whether the object is un-cached or write-through.
>> - */
>> -return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) ||
>> - i915_gem_object_has_cache_level(obj, I915_CACHE_WT));
>> +return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1 &&
>> +   i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 1;
>>  }

[snip]
>> @@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct 
>> reloc_cache *cache,
>>  if (DBG_FORCE_RELOC == FORCE_GTT_RELOC)
>>  return false;
>>
>> -/*
>> - * For objects created by userspace through GEM_CREATE with pat_index
>> - * set by set_pat extension, i915_gem_object_has_cache_level() always
>> - * return true, otherwise the call would fall back to checking whether
>> - * the object is un-cached.
>> - */
>>  return (cache->has_llc ||
>>  obj->cache_dirty ||
>> -!i915_gem_object_has_cache_level(obj, I915_CACHE_NONE));
>> +i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1);
>
> Platforms with relocations and platforms with user-specified PAT have no
> overlap, right?  So a -1 return should be impossible here and this is
> one case where we could just treat the return value as a boolean, right?

My understanding is that the condition here means to say that, if GPU
access is uncached, don't use CPU reloc because the CPU cache might
contain stale data. This condition is sufficient for snooping platforms.
But from MTL onward, the condition show be whether the GPU access is
coherent with CPU. So, we should be checking 1-way coherent flag instead
of UC mode, because even if the GPU access is WB, it's still non-coherent,
thus CPU cache could be out-dated.

[snip]
>> @@ -208,12 +230,6 @@ bool i915_gem_object_can_bypass_llc(struct 
>> drm_i915_gem_object *obj)
>>  if (!(obj->flags & I915_BO_ALLOC_USER))
>>  return false;
>>
>> -/*
>> - * Always flush cache for UMD objects at creation time.
>> - */
>> -if (obj->pat_set_by_user)
>> -return true;
>> -

I'm still worried that the removal of these lines would cause the
MESA failure seen before. I know you are checking pat index below, but
that is only about GPU access. It doesn't tell you how CPU is going to
access the memory. If user space is setting an uncached PAT, then use
copy engine to zero out the memory, but on the CPU side the mapping is
cacheable, you could still seeing garbage data.

I agree the lines don't belong here because it doesn't have anything
to do with LLC, but they need to be moved to the right location instead
of being removed.

>>  /*
>>   * EHL and JSL add the 'Bypass LLC' MOCS entry, which should make it
>>   * possible for userspace to bypass the GTT caching bits set by the
>> @@ -226,7 +242,21 @@ bool i915_gem_object_can_bypass_llc(struct 
>> drm_i915_gem_object *obj)
>>   * it, but since i915 takes the stance of always zeroing me

Re: [Intel-gfx] [PATCH v3] drm/i915: Refactor PAT/object cache handling

2023-07-19 Thread Matt Roper
On Wed, Jul 19, 2023 at 01:37:30PM +0100, Tvrtko Ursulin wrote:
> From: Tvrtko Ursulin 
> 
> Commit 9275277d5324 ("drm/i915: use pat_index instead of cache_level") has
> introduced PAT indices to i915 internal APIs, partially replacing the
> usage of driver internal cache_level, but has also added a few
> questionable design decisions which this patch tries to improve upon.
> 
> Principal change is to invert the per platform cache level to PAT index
> table which was added by the referenced commit, and by doing so enable
> i915 to understand the cache mode between PAT indices, changing them from
> opaque to transparent.
> 
> Once we have the inverted table we are able to remove the hidden false
> "return true" from i915_gem_object_has_cache_level.
> 
> Other changes/fixes/improvements we are able to do:
> 
> 1)
> Replace the enum i915_cache_level with i915_cache_t, composed of a more
> detailed representation of each cache mode (base mode plus flags).
> 
> For instance this way we are able to express the difference between WB and
> 1-way coherent WB on Meteorlake. Which in turn enables us to map the i915
> "cached" mode to the correct Meteorlake PAT index.
> 
> 2)
> We can cache PAT indices of the caching modes used by the driver itself in
> struct drm_i915_private, which eliminates the runtime calls to
> i915_gem_get_pat_index from both high- and low-level i915 components.
> 
> 3)
> We can also cache the caching modes used by the driver for coherent
> access and for display buffers.
> 
> 4)
> Remove the incorrect references to enum i915_cache_level from low level
> PTE encode vfuncs, since those are actually given PAT indices by their
> callers.
> 
> 5)
> Because i915 now understands PAT indices, we can remove the overly
> aggressive flushing triggered from i915_gem_object_can_bypass_llc() and
> limit it to non-coherent write-back mode only.
> 
> 6)
> Finally we are able to replace the platform dependent cache mode to string
> code in debugfs and elsewhere by the single implementation based on
> i915_cache_t.
> 
> v2:
>  * Fix PAT-to-cache-mode table for PVC. (Fei)
>  * Cache display caching mode too. (Fei)
>  * Improve and document criteria in i915_gem_object_can_bypass_llc() (Matt)
> 
> v3:
>  * Checkpath issues.
>  * Cache mode flags check fixed.
> 
> Signed-off-by: Tvrtko Ursulin 
> Fixes: 9275277d5324 ("drm/i915: use pat_index instead of cache_level")
> Cc: Chris Wilson 
> Cc: Fei Yang 
> Cc: Andi Shyti 
> Cc: Matt Roper 
> ---
>  drivers/gpu/drm/i915/Makefile |   1 +
>  .../drm/i915/display/intel_plane_initial.c|   3 +-
>  drivers/gpu/drm/i915/gem/i915_gem_domain.c|  56 ---
>  drivers/gpu/drm/i915/gem/i915_gem_domain.h|   5 +-
>  .../gpu/drm/i915/gem/i915_gem_execbuffer.c|  13 +-
>  drivers/gpu/drm/i915/gem/i915_gem_internal.c  |   4 +-
>  drivers/gpu/drm/i915/gem/i915_gem_mman.c  |  12 +-
>  drivers/gpu/drm/i915/gem/i915_gem_object.c| 152 +++---
>  drivers/gpu/drm/i915/gem/i915_gem_object.h|  11 +-
>  .../gpu/drm/i915/gem/i915_gem_object_types.h  | 116 +
>  drivers/gpu/drm/i915/gem/i915_gem_shmem.c |   8 +-
>  drivers/gpu/drm/i915/gem/i915_gem_stolen.c|  11 +-
>  drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c  |  44 ++---
>  drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |   2 +-
>  .../drm/i915/gem/selftests/huge_gem_object.c  |   4 +-
>  .../gpu/drm/i915/gem/selftests/huge_pages.c   |   6 +-
>  drivers/gpu/drm/i915/gt/gen6_ppgtt.c  |   4 +-
>  drivers/gpu/drm/i915/gt/gen8_ppgtt.c  |  19 +--
>  drivers/gpu/drm/i915/gt/intel_engine_cs.c |   2 +-
>  drivers/gpu/drm/i915/gt/intel_ggtt.c  |  33 ++--
>  drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c |   4 +-
>  drivers/gpu/drm/i915/gt/intel_gtt.c   |   2 +-
>  drivers/gpu/drm/i915/gt/intel_gtt.h   |   3 +-
>  drivers/gpu/drm/i915/gt/intel_migrate.c   |  11 +-
>  drivers/gpu/drm/i915/gt/intel_ppgtt.c |   6 +-
>  .../gpu/drm/i915/gt/intel_ring_submission.c   |   4 +-
>  drivers/gpu/drm/i915/gt/intel_timeline.c  |   2 +-
>  drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   2 +-
>  drivers/gpu/drm/i915/gt/selftest_migrate.c|   9 +-
>  drivers/gpu/drm/i915/gt/selftest_reset.c  |  14 +-
>  drivers/gpu/drm/i915/gt/selftest_tlb.c|   5 +-
>  .../gpu/drm/i915/gt/selftest_workarounds.c|   2 +-
>  drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c  |   8 +-
>  drivers/gpu/drm/i915/i915_cache.c |  91 +++
>  drivers/gpu/drm/i915/i915_cache.h |  60 +++
>  drivers/gpu/drm/i915/i915_debugfs.c   |  53 +-
>  drivers/gpu/drm/i915/i915_driver.c|   5 +
>  drivers/gpu/drm/i915/i915_drv.h   |   5 +
>  drivers/gpu/drm/i915/i915_gem.c   |  21 +--
>  drivers/gpu/drm/i915/i915_gpu_error.c |   7 +-
>  drivers/gpu/drm/i915/i915_pci.c   |  82 +-
>  drivers/gpu/drm/i915/i915_perf.c  |   2 +-
>  drivers/gpu/drm/i915

[Intel-gfx] [PATCH v3] drm/i915: Refactor PAT/object cache handling

2023-07-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Commit 9275277d5324 ("drm/i915: use pat_index instead of cache_level") has
introduced PAT indices to i915 internal APIs, partially replacing the
usage of driver internal cache_level, but has also added a few
questionable design decisions which this patch tries to improve upon.

Principal change is to invert the per platform cache level to PAT index
table which was added by the referenced commit, and by doing so enable
i915 to understand the cache mode between PAT indices, changing them from
opaque to transparent.

Once we have the inverted table we are able to remove the hidden false
"return true" from i915_gem_object_has_cache_level.

Other changes/fixes/improvements we are able to do:

1)
Replace the enum i915_cache_level with i915_cache_t, composed of a more
detailed representation of each cache mode (base mode plus flags).

For instance this way we are able to express the difference between WB and
1-way coherent WB on Meteorlake. Which in turn enables us to map the i915
"cached" mode to the correct Meteorlake PAT index.

2)
We can cache PAT indices of the caching modes used by the driver itself in
struct drm_i915_private, which eliminates the runtime calls to
i915_gem_get_pat_index from both high- and low-level i915 components.

3)
We can also cache the caching modes used by the driver for coherent
access and for display buffers.

4)
Remove the incorrect references to enum i915_cache_level from low level
PTE encode vfuncs, since those are actually given PAT indices by their
callers.

5)
Because i915 now understands PAT indices, we can remove the overly
aggressive flushing triggered from i915_gem_object_can_bypass_llc() and
limit it to non-coherent write-back mode only.

6)
Finally we are able to replace the platform dependent cache mode to string
code in debugfs and elsewhere by the single implementation based on
i915_cache_t.

v2:
 * Fix PAT-to-cache-mode table for PVC. (Fei)
 * Cache display caching mode too. (Fei)
 * Improve and document criteria in i915_gem_object_can_bypass_llc() (Matt)

v3:
 * Checkpath issues.
 * Cache mode flags check fixed.

Signed-off-by: Tvrtko Ursulin 
Fixes: 9275277d5324 ("drm/i915: use pat_index instead of cache_level")
Cc: Chris Wilson 
Cc: Fei Yang 
Cc: Andi Shyti 
Cc: Matt Roper 
---
 drivers/gpu/drm/i915/Makefile |   1 +
 .../drm/i915/display/intel_plane_initial.c|   3 +-
 drivers/gpu/drm/i915/gem/i915_gem_domain.c|  56 ---
 drivers/gpu/drm/i915/gem/i915_gem_domain.h|   5 +-
 .../gpu/drm/i915/gem/i915_gem_execbuffer.c|  13 +-
 drivers/gpu/drm/i915/gem/i915_gem_internal.c  |   4 +-
 drivers/gpu/drm/i915/gem/i915_gem_mman.c  |  12 +-
 drivers/gpu/drm/i915/gem/i915_gem_object.c| 152 +++---
 drivers/gpu/drm/i915/gem/i915_gem_object.h|  11 +-
 .../gpu/drm/i915/gem/i915_gem_object_types.h  | 116 +
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c |   8 +-
 drivers/gpu/drm/i915/gem/i915_gem_stolen.c|  11 +-
 drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c  |  44 ++---
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |   2 +-
 .../drm/i915/gem/selftests/huge_gem_object.c  |   4 +-
 .../gpu/drm/i915/gem/selftests/huge_pages.c   |   6 +-
 drivers/gpu/drm/i915/gt/gen6_ppgtt.c  |   4 +-
 drivers/gpu/drm/i915/gt/gen8_ppgtt.c  |  19 +--
 drivers/gpu/drm/i915/gt/intel_engine_cs.c |   2 +-
 drivers/gpu/drm/i915/gt/intel_ggtt.c  |  33 ++--
 drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c |   4 +-
 drivers/gpu/drm/i915/gt/intel_gtt.c   |   2 +-
 drivers/gpu/drm/i915/gt/intel_gtt.h   |   3 +-
 drivers/gpu/drm/i915/gt/intel_migrate.c   |  11 +-
 drivers/gpu/drm/i915/gt/intel_ppgtt.c |   6 +-
 .../gpu/drm/i915/gt/intel_ring_submission.c   |   4 +-
 drivers/gpu/drm/i915/gt/intel_timeline.c  |   2 +-
 drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   2 +-
 drivers/gpu/drm/i915/gt/selftest_migrate.c|   9 +-
 drivers/gpu/drm/i915/gt/selftest_reset.c  |  14 +-
 drivers/gpu/drm/i915/gt/selftest_tlb.c|   5 +-
 .../gpu/drm/i915/gt/selftest_workarounds.c|   2 +-
 drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c  |   8 +-
 drivers/gpu/drm/i915/i915_cache.c |  91 +++
 drivers/gpu/drm/i915/i915_cache.h |  60 +++
 drivers/gpu/drm/i915/i915_debugfs.c   |  53 +-
 drivers/gpu/drm/i915/i915_driver.c|   5 +
 drivers/gpu/drm/i915/i915_drv.h   |   5 +
 drivers/gpu/drm/i915/i915_gem.c   |  21 +--
 drivers/gpu/drm/i915/i915_gpu_error.c |   7 +-
 drivers/gpu/drm/i915/i915_pci.c   |  82 +-
 drivers/gpu/drm/i915/i915_perf.c  |   2 +-
 drivers/gpu/drm/i915/intel_device_info.h  |   6 +-
 drivers/gpu/drm/i915/selftests/i915_gem.c |   5 +-
 .../gpu/drm/i915/selftests/i915_gem_evict.c   |   8 +-
 drivers/gpu/drm/i915/selftests/i915_gem_gtt.c |  13 +-
 drivers/gpu/drm/i915/selftests/igt_spinner.c  |   2 +-
 .../drm/i915/selftes