Re: [PATCH v3] drm/i915: Refactor PAT/object cache handling
On Thu, Jul 20, 2023 at 09:28:56PM -0700, Yang, Fei wrote: > >>> [snip] > > @@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct > > drm_i915_gem_object *obj) > > The code change here looks accurate, but while we're here, I have a > side question about this function in general...it was originally > introduced in commit 48004881f693 ("drm/i915: Mark CPU cache as > dirty when used for > rendering") which states that GPU rendering ends up in the CPU cache > (and thus needs a clflush later to make sure it lands in memory). > That makes sense to me for LLC platforms, but is it really true for > non-LLC snooping platforms (like MTL) as the commit states? > >>> > >>> For non-LLC platforms objects can be set to 1-way coherent which > >>> means GPU rendering ending up in CPU cache as well, so for non-LLC > >>> platform the logic here should be checking 1-way coherent flag. > >> > >> That's the part that I'm questioning (and not just for MTL, but for > >> all of our other non-LLC platforms too). Just because there's > >> coherency doesn't mean that device writes landed in the CPU cache. > >> Coherency is also achieved if device writes invalidate the contents of the > >> CPU cache. > >> I thought our non-LLC snooping platforms were coherent due to > >> write-invalidate rather than write-update, but I can't find it > >> specifically documented anywhere at the moment. If write-invalidate > >> was used, then there shouldn't be a need for a later clflush either. > > > > [Trying to consolidate by doing a combined reply to the discussion so far.] > > > > On the write-invalidate vs write-update I don't know. If you did not > > find it in bspec then I doubt I would. I can have a browse still. > > Matt was correct. Quote Ron Silvas from SW ARCH, "MTL GPU doesn't write to > CPU cache, it simply snoop CPU cache on its way to RAM." That's good to know. We probably also want to find out if the same is true for old snooping platforms (i.e., things like VLV/CHV/BXT) and/or dgpu's (where I think the behavior is probably defined by the pcie spec, but I'm not sure where to look for that). > > My understanding > was that snooping platforms just invalidated the CPU cache to > prevent future CPU reads from seeing stale data but didn't actually > stick any new data in there? Am I off track or is the original > logic of this function not quite right? > > Anyway, even if the logic of this function is wrong, it's a mistake > that would only hurt performance > >>> > >>> Yes, this logic will introduce performance impact because it's > >>> missing the checking for obj->pat_set_by_user. For objects with > >>> pat_set_by_user==true, even if the object is snooping or 1-way > >>> coherent, we don't want to enforce a clflush here since the > >>> coherency is supposed to be handled by user space. > > > > What should I add you think to fix it? > > I think the simplest would be > > if (obj->pat_set_by_user) > return false; > > because even checking for incoherent WB is unnecessary, simply no > need for the KMD to initiate a flush if PAT is set by user. I guess one thing we're missing today is a well-documented explanation of the expectations for i915_gem_set_domain_ioctl behavior when we're using a user-defined PAT? Matt > > > Add a check for non-coherent WB in gpu_write_needs_clflush as an additional > > condition for returning false? > > > > And then if Matt is correct write-invalidate is used also !HAS_LLC should > > just return false? > > > (flushing more often than we truly need to) rather than > functionality, so not something we really need to dig into right now > as part of this patch. > > > if (IS_DGFX(i915)) > > return false; > > > > -/* > > - * For objects created by userspace through GEM_CREATE with > > pat_index > > - * set by set_pat extension, i915_gem_object_has_cache_level() will > > - * always return true, because the coherency of such object is > > managed > > - * by userspace. Othereise the call here would fall back to > > checking > > - * whether the object is un-cached or write-through. > > - */ > > -return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) || > > - i915_gem_object_has_cache_level(obj, I915_CACHE_WT)); > > +return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != > > 1 && > > + i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != > > 1; > > } > >>> > >>> [snip] > > @@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct > > reloc_cache *cache, > > if (DBG_FORCE_RELOC == FORCE_GTT_RELOC) > > return false; > > > > -/* > > - * For objects created by userspace through GEM_CREATE with > > pat_index >
Re: [PATCH v3] drm/i915: Refactor PAT/object cache handling
On 21/07/2023 05:28, Yang, Fei wrote: [snip] @@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct drm_i915_gem_object *obj) The code change here looks accurate, but while we're here, I have a side question about this function in general...it was originally introduced in commit 48004881f693 ("drm/i915: Mark CPU cache as dirty when used for rendering") which states that GPU rendering ends up in the CPU cache (and thus needs a clflush later to make sure it lands in memory). That makes sense to me for LLC platforms, but is it really true for non-LLC snooping platforms (like MTL) as the commit states? For non-LLC platforms objects can be set to 1-way coherent which means GPU rendering ending up in CPU cache as well, so for non-LLC platform the logic here should be checking 1-way coherent flag. That's the part that I'm questioning (and not just for MTL, but for all of our other non-LLC platforms too). Just because there's coherency doesn't mean that device writes landed in the CPU cache. Coherency is also achieved if device writes invalidate the contents of the CPU cache. I thought our non-LLC snooping platforms were coherent due to write-invalidate rather than write-update, but I can't find it specifically documented anywhere at the moment. If write-invalidate was used, then there shouldn't be a need for a later clflush either. [Trying to consolidate by doing a combined reply to the discussion so far.] On the write-invalidate vs write-update I don't know. If you did not find it in bspec then I doubt I would. I can have a browse still. Matt was correct. Quote Ron Silvas from SW ARCH, "MTL GPU doesn't write to CPU cache, it simply snoop CPU cache on its way to RAM." Does it apply to all snooping platforms? And for the cache level/mode based condition, how about replacing it with this: /* * Fully coherent cached access may end up with data in the CPU cache * which hasn't hit memory yet. */ return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WB) && i915_gem_object_has_cache_flag(obj, I915_CACHE_FLAG_COH2W); ? Although that would mean old I915_CACHE_LLC on old platforms is actually 2-way coherent. I am struggling to find a comprehensive explanation in bspec, but for instance 605 makes it sounds like it is fully coherent. Perhaps it really is and I should fix the legacy and Gen12 table.. And if the write-invalidate applies to all snooping platforms then we extend it to: /* * Fully coherent cached access may end up with data in the CPU cache * which hasn't hit memory yet. * * But not on snooping platforms, where it is impossible due * write-invalidate. */ return !HAS_SNOOP(i915) && (i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WB) && i915_gem_object_has_cache_flag(obj, I915_CACHE_FLAG_COH2W)); That would prevent any flushing on MTL and make you happy from that aspect. In fact, the snooping check could be before the cache mode check. For i915_gem_object_can_bypass_llc it would be ideal if a condition based on the absence of I915_BO_CACHE_COHERENT_FOR_WRITE would work. At least according to the kerneldoc for @cache_coherent: * I915_BO_CACHE_COHERENT_FOR_WRITE: * * When writing through the CPU cache, the GPU is still coherent. Note * that this also implies I915_BO_CACHE_COHERENT_FOR_READ. So for objects without it set, we need to force a flush. And make __i915_gem_object_update_coherency not set it for WB without 1-way coherency set. According to bspec that would seem correct, because with 1-way snooping on MTL, GPU snoops the IA until first GPU access. So anything the CPU writes before the first GPU access would be coherent and so no need to flush in set pages. But if non-coherent WB is set then we need to flush. I'll trybot it is and see what happens. My understanding was that snooping platforms just invalidated the CPU cache to prevent future CPU reads from seeing stale data but didn't actually stick any new data in there? Am I off track or is the original logic of this function not quite right? Anyway, even if the logic of this function is wrong, it's a mistake that would only hurt performance Yes, this logic will introduce performance impact because it's missing the checking for obj->pat_set_by_user. For objects with pat_set_by_user==true, even if the object is snooping or 1-way coherent, we don't want to enforce a clflush here since the coherency is supposed to be handled by user space. What should I add you think to fix it? I think the simplest would be if (obj->pat_set_by_user) return false; because even checking for incoherent WB is unnecessary, simply no need for the KMD to initiate a flush if PAT is set by user. Add a check for non-coherent WB in gpu_write_needs_clflush as an additional condition for returning
RE: [PATCH v3] drm/i915: Refactor PAT/object cache handling
>>> [snip] > @@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct > drm_i915_gem_object *obj) The code change here looks accurate, but while we're here, I have a side question about this function in general...it was originally introduced in commit 48004881f693 ("drm/i915: Mark CPU cache as dirty when used for rendering") which states that GPU rendering ends up in the CPU cache (and thus needs a clflush later to make sure it lands in memory). That makes sense to me for LLC platforms, but is it really true for non-LLC snooping platforms (like MTL) as the commit states? >>> >>> For non-LLC platforms objects can be set to 1-way coherent which >>> means GPU rendering ending up in CPU cache as well, so for non-LLC >>> platform the logic here should be checking 1-way coherent flag. >> >> That's the part that I'm questioning (and not just for MTL, but for >> all of our other non-LLC platforms too). Just because there's >> coherency doesn't mean that device writes landed in the CPU cache. >> Coherency is also achieved if device writes invalidate the contents of the >> CPU cache. >> I thought our non-LLC snooping platforms were coherent due to >> write-invalidate rather than write-update, but I can't find it >> specifically documented anywhere at the moment. If write-invalidate >> was used, then there shouldn't be a need for a later clflush either. > > [Trying to consolidate by doing a combined reply to the discussion so far.] > > On the write-invalidate vs write-update I don't know. If you did not > find it in bspec then I doubt I would. I can have a browse still. Matt was correct. Quote Ron Silvas from SW ARCH, "MTL GPU doesn't write to CPU cache, it simply snoop CPU cache on its way to RAM." My understanding was that snooping platforms just invalidated the CPU cache to prevent future CPU reads from seeing stale data but didn't actually stick any new data in there? Am I off track or is the original logic of this function not quite right? Anyway, even if the logic of this function is wrong, it's a mistake that would only hurt performance >>> >>> Yes, this logic will introduce performance impact because it's >>> missing the checking for obj->pat_set_by_user. For objects with >>> pat_set_by_user==true, even if the object is snooping or 1-way >>> coherent, we don't want to enforce a clflush here since the >>> coherency is supposed to be handled by user space. > > What should I add you think to fix it? I think the simplest would be if (obj->pat_set_by_user) return false; because even checking for incoherent WB is unnecessary, simply no need for the KMD to initiate a flush if PAT is set by user. > Add a check for non-coherent WB in gpu_write_needs_clflush as an additional > condition for returning false? > > And then if Matt is correct write-invalidate is used also !HAS_LLC should > just return false? > (flushing more often than we truly need to) rather than functionality, so not something we really need to dig into right now as part of this patch. > if (IS_DGFX(i915)) > return false; > > -/* > - * For objects created by userspace through GEM_CREATE with pat_index > - * set by set_pat extension, i915_gem_object_has_cache_level() will > - * always return true, because the coherency of such object is > managed > - * by userspace. Othereise the call here would fall back to checking > - * whether the object is un-cached or write-through. > - */ > -return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) || > - i915_gem_object_has_cache_level(obj, I915_CACHE_WT)); > +return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1 > && > + i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 1; > } >>> >>> [snip] > @@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct > reloc_cache *cache, > if (DBG_FORCE_RELOC == FORCE_GTT_RELOC) > return false; > > -/* > - * For objects created by userspace through GEM_CREATE with pat_index > - * set by set_pat extension, i915_gem_object_has_cache_level() always > - * return true, otherwise the call would fall back to checking > whether > - * the object is un-cached. > - */ > return (cache->has_llc || > obj->cache_dirty || > -!i915_gem_object_has_cache_level(obj, I915_CACHE_NONE)); > +i915_gem_object_has_cache_mode(obj, > + I915_CACHE_MODE_UC) != 1); Platforms with relocations and platforms with user-specified PAT have no overlap, right? So a -1 return should be impossible here and this is one case where we could just treat the return value as a boolean, right? >>> > > Hm no, or m
Re: [PATCH v3] drm/i915: Refactor PAT/object cache handling
[Here let me just focus on the points which did not get further discussion in follow ups yet.] On 19/07/2023 23:31, Matt Roper wrote: On Wed, Jul 19, 2023 at 01:37:30PM +0100, Tvrtko Ursulin wrote: From: Tvrtko Ursulin Commit 9275277d5324 ("drm/i915: use pat_index instead of cache_level") has introduced PAT indices to i915 internal APIs, partially replacing the usage of driver internal cache_level, but has also added a few questionable design decisions which this patch tries to improve upon. Principal change is to invert the per platform cache level to PAT index table which was added by the referenced commit, and by doing so enable i915 to understand the cache mode between PAT indices, changing them from opaque to transparent. Once we have the inverted table we are able to remove the hidden false "return true" from i915_gem_object_has_cache_level. Other changes/fixes/improvements we are able to do: 1) Replace the enum i915_cache_level with i915_cache_t, composed of a more detailed representation of each cache mode (base mode plus flags). For instance this way we are able to express the difference between WB and 1-way coherent WB on Meteorlake. Which in turn enables us to map the i915 "cached" mode to the correct Meteorlake PAT index. 2) We can cache PAT indices of the caching modes used by the driver itself in struct drm_i915_private, which eliminates the runtime calls to i915_gem_get_pat_index from both high- and low-level i915 components. 3) We can also cache the caching modes used by the driver for coherent access and for display buffers. 4) Remove the incorrect references to enum i915_cache_level from low level PTE encode vfuncs, since those are actually given PAT indices by their callers. 5) Because i915 now understands PAT indices, we can remove the overly aggressive flushing triggered from i915_gem_object_can_bypass_llc() and limit it to non-coherent write-back mode only. 6) Finally we are able to replace the platform dependent cache mode to string code in debugfs and elsewhere by the single implementation based on i915_cache_t. v2: * Fix PAT-to-cache-mode table for PVC. (Fei) * Cache display caching mode too. (Fei) * Improve and document criteria in i915_gem_object_can_bypass_llc() (Matt) v3: * Checkpath issues. * Cache mode flags check fixed. Signed-off-by: Tvrtko Ursulin Fixes: 9275277d5324 ("drm/i915: use pat_index instead of cache_level") Cc: Chris Wilson Cc: Fei Yang Cc: Andi Shyti Cc: Matt Roper --- drivers/gpu/drm/i915/Makefile | 1 + .../drm/i915/display/intel_plane_initial.c| 3 +- drivers/gpu/drm/i915/gem/i915_gem_domain.c| 56 --- drivers/gpu/drm/i915/gem/i915_gem_domain.h| 5 +- .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 13 +- drivers/gpu/drm/i915/gem/i915_gem_internal.c | 4 +- drivers/gpu/drm/i915/gem/i915_gem_mman.c | 12 +- drivers/gpu/drm/i915/gem/i915_gem_object.c| 152 +++--- drivers/gpu/drm/i915/gem/i915_gem_object.h| 11 +- .../gpu/drm/i915/gem/i915_gem_object_types.h | 116 + drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 8 +- drivers/gpu/drm/i915/gem/i915_gem_stolen.c| 11 +- drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 44 ++--- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 2 +- .../drm/i915/gem/selftests/huge_gem_object.c | 4 +- .../gpu/drm/i915/gem/selftests/huge_pages.c | 6 +- drivers/gpu/drm/i915/gt/gen6_ppgtt.c | 4 +- drivers/gpu/drm/i915/gt/gen8_ppgtt.c | 19 +-- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 2 +- drivers/gpu/drm/i915/gt/intel_ggtt.c | 33 ++-- drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c | 4 +- drivers/gpu/drm/i915/gt/intel_gtt.c | 2 +- drivers/gpu/drm/i915/gt/intel_gtt.h | 3 +- drivers/gpu/drm/i915/gt/intel_migrate.c | 11 +- drivers/gpu/drm/i915/gt/intel_ppgtt.c | 6 +- .../gpu/drm/i915/gt/intel_ring_submission.c | 4 +- drivers/gpu/drm/i915/gt/intel_timeline.c | 2 +- drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 2 +- drivers/gpu/drm/i915/gt/selftest_migrate.c| 9 +- drivers/gpu/drm/i915/gt/selftest_reset.c | 14 +- drivers/gpu/drm/i915/gt/selftest_tlb.c| 5 +- .../gpu/drm/i915/gt/selftest_workarounds.c| 2 +- drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c | 8 +- drivers/gpu/drm/i915/i915_cache.c | 91 +++ drivers/gpu/drm/i915/i915_cache.h | 60 +++ drivers/gpu/drm/i915/i915_debugfs.c | 53 +- drivers/gpu/drm/i915/i915_driver.c| 5 + drivers/gpu/drm/i915/i915_drv.h | 5 + drivers/gpu/drm/i915/i915_gem.c | 21 +-- drivers/gpu/drm/i915/i915_gpu_error.c | 7 +- drivers/gpu/drm/i915/i915_pci.c | 82 +- drivers/gpu/drm/i915/i915_perf.c | 2 +- drivers/gpu/drm/i915/intel_device_info.h |
Re: [PATCH v3] drm/i915: Refactor PAT/object cache handling
On 20/07/2023 01:22, Matt Roper wrote: On Wed, Jul 19, 2023 at 05:07:15PM -0700, Yang, Fei wrote: [snip] @@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct drm_i915_gem_object *obj) The code change here looks accurate, but while we're here, I have a side question about this function in general...it was originally introduced in commit 48004881f693 ("drm/i915: Mark CPU cache as dirty when used for rendering") which states that GPU rendering ends up in the CPU cache (and thus needs a clflush later to make sure it lands in memory). That makes sense to me for LLC platforms, but is it really true for non-LLC snooping platforms (like MTL) as the commit states? For non-LLC platforms objects can be set to 1-way coherent which means GPU rendering ending up in CPU cache as well, so for non-LLC platform the logic here should be checking 1-way coherent flag. That's the part that I'm questioning (and not just for MTL, but for all of our other non-LLC platforms too). Just because there's coherency doesn't mean that device writes landed in the CPU cache. Coherency is also achieved if device writes invalidate the contents of the CPU cache. I thought our non-LLC snooping platforms were coherent due to write-invalidate rather than write-update, but I can't find it specifically documented anywhere at the moment. If write-invalidate was used, then there shouldn't be a need for a later clflush either. [Trying to consolidate by doing a combined reply to the discussion so far.] On the write-invalidate vs write-update I don't know. If you did not find it in bspec then I doubt I would. I can have a browse still. My understanding was that snooping platforms just invalidated the CPU cache to prevent future CPU reads from seeing stale data but didn't actually stick any new data in there? Am I off track or is the original logic of this function not quite right? Anyway, even if the logic of this function is wrong, it's a mistake that would only hurt performance Yes, this logic will introduce performance impact because it's missing the checking for obj->pat_set_by_user. For objects with pat_set_by_user==true, even if the object is snooping or 1-way coherent, we don't want to enforce a clflush here since the coherency is supposed to be handled by user space. What should I add you think to fix it? Add a check for non-coherent WB in gpu_write_needs_clflush as an additional condition for returning false? And then if Matt is correct write-invalidate is used also !HAS_LLC should just return false? (flushing more often than we truly need to) rather than functionality, so not something we really need to dig into right now as part of this patch. if (IS_DGFX(i915)) return false; -/* - * For objects created by userspace through GEM_CREATE with pat_index - * set by set_pat extension, i915_gem_object_has_cache_level() will - * always return true, because the coherency of such object is managed - * by userspace. Othereise the call here would fall back to checking - * whether the object is un-cached or write-through. - */ -return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) || - i915_gem_object_has_cache_level(obj, I915_CACHE_WT)); +return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1 && + i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 1; } [snip] @@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct reloc_cache *cache, if (DBG_FORCE_RELOC == FORCE_GTT_RELOC) return false; -/* - * For objects created by userspace through GEM_CREATE with pat_index - * set by set_pat extension, i915_gem_object_has_cache_level() always - * return true, otherwise the call would fall back to checking whether - * the object is un-cached. - */ return (cache->has_llc || obj->cache_dirty || -!i915_gem_object_has_cache_level(obj, I915_CACHE_NONE)); +i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1); Platforms with relocations and platforms with user-specified PAT have no overlap, right? So a -1 return should be impossible here and this is one case where we could just treat the return value as a boolean, right? Hm no, or maybe. My thinking behind tri-state is to allow a safe option for "don't know". In case PAT index to cache mode table is not fully populated on some future platform. My understanding is that the condition here means to say that, if GPU access is uncached, don't use CPU reloc because the CPU cache might contain stale data. This condition is sufficient for snooping platforms. But from MTL onward, the condition show be whether the GPU access is coherent with CPU. So, we should be checking 1-way coherent flag instead of UC mode, because even if the GPU access is WB, it's still non-coherent, thus CPU cache could be out-dated. Honestly the matrix of caching decision/logic i
Re: [PATCH v3] drm/i915: Refactor PAT/object cache handling
On Wed, Jul 19, 2023 at 05:07:15PM -0700, Yang, Fei wrote: > [snip] > >> @@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct > >> drm_i915_gem_object *obj) > > > > The code change here looks accurate, but while we're here, I have a side > > question about this function in general...it was originally introduced > > in commit 48004881f693 ("drm/i915: Mark CPU cache as dirty when used for > > rendering") which states that GPU rendering ends up in the CPU cache > > (and thus needs a clflush later to make sure it lands in memory). That > > makes sense to me for LLC platforms, but is it really true for non-LLC > > snooping platforms (like MTL) as the commit states? > > For non-LLC platforms objects can be set to 1-way coherent which means > GPU rendering ending up in CPU cache as well, so for non-LLC platform > the logic here should be checking 1-way coherent flag. That's the part that I'm questioning (and not just for MTL, but for all of our other non-LLC platforms too). Just because there's coherency doesn't mean that device writes landed in the CPU cache. Coherency is also achieved if device writes invalidate the contents of the CPU cache. I thought our non-LLC snooping platforms were coherent due to write-invalidate rather than write-update, but I can't find it specifically documented anywhere at the moment. If write-invalidate was used, then there shouldn't be a need for a later clflush either. > > > My understanding > > was that snooping platforms just invalidated the CPU cache to prevent > > future CPU reads from seeing stale data but didn't actually stick any > > new data in there? Am I off track or is the original logic of this > > function not quite right? > > > > Anyway, even if the logic of this function is wrong, it's a mistake that > > would only hurt performance > > Yes, this logic will introduce performance impact because it's missing the > checking for obj->pat_set_by_user. For objects with pat_set_by_user==true, > even if the object is snooping or 1-way coherent, we don't want to enforce > a clflush here since the coherency is supposed to be handled by user space. > > > (flushing more often than we truly need to) > > rather than functionality, so not something we really need to dig into > > right now as part of this patch. > > > >> if (IS_DGFX(i915)) > >> return false; > >> > >> -/* > >> - * For objects created by userspace through GEM_CREATE with pat_index > >> - * set by set_pat extension, i915_gem_object_has_cache_level() will > >> - * always return true, because the coherency of such object is managed > >> - * by userspace. Othereise the call here would fall back to checking > >> - * whether the object is un-cached or write-through. > >> - */ > >> -return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) || > >> - i915_gem_object_has_cache_level(obj, I915_CACHE_WT)); > >> +return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1 && > >> + i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 1; > >> } > > [snip] > >> @@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct > >> reloc_cache *cache, > >> if (DBG_FORCE_RELOC == FORCE_GTT_RELOC) > >> return false; > >> > >> -/* > >> - * For objects created by userspace through GEM_CREATE with pat_index > >> - * set by set_pat extension, i915_gem_object_has_cache_level() always > >> - * return true, otherwise the call would fall back to checking whether > >> - * the object is un-cached. > >> - */ > >> return (cache->has_llc || > >> obj->cache_dirty || > >> -!i915_gem_object_has_cache_level(obj, I915_CACHE_NONE)); > >> +i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1); > > > > Platforms with relocations and platforms with user-specified PAT have no > > overlap, right? So a -1 return should be impossible here and this is > > one case where we could just treat the return value as a boolean, right? > > My understanding is that the condition here means to say that, if GPU > access is uncached, don't use CPU reloc because the CPU cache might > contain stale data. This condition is sufficient for snooping platforms. > But from MTL onward, the condition show be whether the GPU access is > coherent with CPU. So, we should be checking 1-way coherent flag instead > of UC mode, because even if the GPU access is WB, it's still non-coherent, > thus CPU cache could be out-dated. My point is that this is relocation code --- it should be impossible to get here on MTL and beyond, right? So user-provided PAT isn't a consideration. > > [snip] > >> @@ -208,12 +230,6 @@ bool i915_gem_object_can_bypass_llc(struct > >> drm_i915_gem_object *obj) > >> if (!(obj->flags & I915_BO_ALLOC_USER)) > >> return false; > >> > >> -/* > >> - * Always flush cache for UMD objects at creation time. > >> - */ > >> -if (ob
RE: [PATCH v3] drm/i915: Refactor PAT/object cache handling
[snip] >> @@ -27,15 +28,8 @@ static bool gpu_write_needs_clflush(struct >> drm_i915_gem_object *obj) > > The code change here looks accurate, but while we're here, I have a side > question about this function in general...it was originally introduced > in commit 48004881f693 ("drm/i915: Mark CPU cache as dirty when used for > rendering") which states that GPU rendering ends up in the CPU cache > (and thus needs a clflush later to make sure it lands in memory). That > makes sense to me for LLC platforms, but is it really true for non-LLC > snooping platforms (like MTL) as the commit states? For non-LLC platforms objects can be set to 1-way coherent which means GPU rendering ending up in CPU cache as well, so for non-LLC platform the logic here should be checking 1-way coherent flag. > My understanding > was that snooping platforms just invalidated the CPU cache to prevent > future CPU reads from seeing stale data but didn't actually stick any > new data in there? Am I off track or is the original logic of this > function not quite right? > > Anyway, even if the logic of this function is wrong, it's a mistake that > would only hurt performance Yes, this logic will introduce performance impact because it's missing the checking for obj->pat_set_by_user. For objects with pat_set_by_user==true, even if the object is snooping or 1-way coherent, we don't want to enforce a clflush here since the coherency is supposed to be handled by user space. > (flushing more often than we truly need to) > rather than functionality, so not something we really need to dig into > right now as part of this patch. > >> if (IS_DGFX(i915)) >> return false; >> >> -/* >> - * For objects created by userspace through GEM_CREATE with pat_index >> - * set by set_pat extension, i915_gem_object_has_cache_level() will >> - * always return true, because the coherency of such object is managed >> - * by userspace. Othereise the call here would fall back to checking >> - * whether the object is un-cached or write-through. >> - */ >> -return !(i915_gem_object_has_cache_level(obj, I915_CACHE_NONE) || >> - i915_gem_object_has_cache_level(obj, I915_CACHE_WT)); >> +return i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1 && >> + i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_WT) != 1; >> } [snip] >> @@ -640,15 +640,9 @@ static inline int use_cpu_reloc(const struct >> reloc_cache *cache, >> if (DBG_FORCE_RELOC == FORCE_GTT_RELOC) >> return false; >> >> -/* >> - * For objects created by userspace through GEM_CREATE with pat_index >> - * set by set_pat extension, i915_gem_object_has_cache_level() always >> - * return true, otherwise the call would fall back to checking whether >> - * the object is un-cached. >> - */ >> return (cache->has_llc || >> obj->cache_dirty || >> -!i915_gem_object_has_cache_level(obj, I915_CACHE_NONE)); >> +i915_gem_object_has_cache_mode(obj, I915_CACHE_MODE_UC) != 1); > > Platforms with relocations and platforms with user-specified PAT have no > overlap, right? So a -1 return should be impossible here and this is > one case where we could just treat the return value as a boolean, right? My understanding is that the condition here means to say that, if GPU access is uncached, don't use CPU reloc because the CPU cache might contain stale data. This condition is sufficient for snooping platforms. But from MTL onward, the condition show be whether the GPU access is coherent with CPU. So, we should be checking 1-way coherent flag instead of UC mode, because even if the GPU access is WB, it's still non-coherent, thus CPU cache could be out-dated. [snip] >> @@ -208,12 +230,6 @@ bool i915_gem_object_can_bypass_llc(struct >> drm_i915_gem_object *obj) >> if (!(obj->flags & I915_BO_ALLOC_USER)) >> return false; >> >> -/* >> - * Always flush cache for UMD objects at creation time. >> - */ >> -if (obj->pat_set_by_user) >> -return true; >> - I'm still worried that the removal of these lines would cause the MESA failure seen before. I know you are checking pat index below, but that is only about GPU access. It doesn't tell you how CPU is going to access the memory. If user space is setting an uncached PAT, then use copy engine to zero out the memory, but on the CPU side the mapping is cacheable, you could still seeing garbage data. I agree the lines don't belong here because it doesn't have anything to do with LLC, but they need to be moved to the right location instead of being removed. >> /* >> * EHL and JSL add the 'Bypass LLC' MOCS entry, which should make it >> * possible for userspace to bypass the GTT caching bits set by the >> @@ -226,7 +242,21 @@ bool i915_gem_object_can_bypass_llc(struct >> drm_i915_gem_object *obj) >> * it, but since i915 takes the stance of always zeroing me
Re: [PATCH v3] drm/i915: Refactor PAT/object cache handling
On Wed, Jul 19, 2023 at 01:37:30PM +0100, Tvrtko Ursulin wrote: > From: Tvrtko Ursulin > > Commit 9275277d5324 ("drm/i915: use pat_index instead of cache_level") has > introduced PAT indices to i915 internal APIs, partially replacing the > usage of driver internal cache_level, but has also added a few > questionable design decisions which this patch tries to improve upon. > > Principal change is to invert the per platform cache level to PAT index > table which was added by the referenced commit, and by doing so enable > i915 to understand the cache mode between PAT indices, changing them from > opaque to transparent. > > Once we have the inverted table we are able to remove the hidden false > "return true" from i915_gem_object_has_cache_level. > > Other changes/fixes/improvements we are able to do: > > 1) > Replace the enum i915_cache_level with i915_cache_t, composed of a more > detailed representation of each cache mode (base mode plus flags). > > For instance this way we are able to express the difference between WB and > 1-way coherent WB on Meteorlake. Which in turn enables us to map the i915 > "cached" mode to the correct Meteorlake PAT index. > > 2) > We can cache PAT indices of the caching modes used by the driver itself in > struct drm_i915_private, which eliminates the runtime calls to > i915_gem_get_pat_index from both high- and low-level i915 components. > > 3) > We can also cache the caching modes used by the driver for coherent > access and for display buffers. > > 4) > Remove the incorrect references to enum i915_cache_level from low level > PTE encode vfuncs, since those are actually given PAT indices by their > callers. > > 5) > Because i915 now understands PAT indices, we can remove the overly > aggressive flushing triggered from i915_gem_object_can_bypass_llc() and > limit it to non-coherent write-back mode only. > > 6) > Finally we are able to replace the platform dependent cache mode to string > code in debugfs and elsewhere by the single implementation based on > i915_cache_t. > > v2: > * Fix PAT-to-cache-mode table for PVC. (Fei) > * Cache display caching mode too. (Fei) > * Improve and document criteria in i915_gem_object_can_bypass_llc() (Matt) > > v3: > * Checkpath issues. > * Cache mode flags check fixed. > > Signed-off-by: Tvrtko Ursulin > Fixes: 9275277d5324 ("drm/i915: use pat_index instead of cache_level") > Cc: Chris Wilson > Cc: Fei Yang > Cc: Andi Shyti > Cc: Matt Roper > --- > drivers/gpu/drm/i915/Makefile | 1 + > .../drm/i915/display/intel_plane_initial.c| 3 +- > drivers/gpu/drm/i915/gem/i915_gem_domain.c| 56 --- > drivers/gpu/drm/i915/gem/i915_gem_domain.h| 5 +- > .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 13 +- > drivers/gpu/drm/i915/gem/i915_gem_internal.c | 4 +- > drivers/gpu/drm/i915/gem/i915_gem_mman.c | 12 +- > drivers/gpu/drm/i915/gem/i915_gem_object.c| 152 +++--- > drivers/gpu/drm/i915/gem/i915_gem_object.h| 11 +- > .../gpu/drm/i915/gem/i915_gem_object_types.h | 116 + > drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 8 +- > drivers/gpu/drm/i915/gem/i915_gem_stolen.c| 11 +- > drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 44 ++--- > drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 2 +- > .../drm/i915/gem/selftests/huge_gem_object.c | 4 +- > .../gpu/drm/i915/gem/selftests/huge_pages.c | 6 +- > drivers/gpu/drm/i915/gt/gen6_ppgtt.c | 4 +- > drivers/gpu/drm/i915/gt/gen8_ppgtt.c | 19 +-- > drivers/gpu/drm/i915/gt/intel_engine_cs.c | 2 +- > drivers/gpu/drm/i915/gt/intel_ggtt.c | 33 ++-- > drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c | 4 +- > drivers/gpu/drm/i915/gt/intel_gtt.c | 2 +- > drivers/gpu/drm/i915/gt/intel_gtt.h | 3 +- > drivers/gpu/drm/i915/gt/intel_migrate.c | 11 +- > drivers/gpu/drm/i915/gt/intel_ppgtt.c | 6 +- > .../gpu/drm/i915/gt/intel_ring_submission.c | 4 +- > drivers/gpu/drm/i915/gt/intel_timeline.c | 2 +- > drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 2 +- > drivers/gpu/drm/i915/gt/selftest_migrate.c| 9 +- > drivers/gpu/drm/i915/gt/selftest_reset.c | 14 +- > drivers/gpu/drm/i915/gt/selftest_tlb.c| 5 +- > .../gpu/drm/i915/gt/selftest_workarounds.c| 2 +- > drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c | 8 +- > drivers/gpu/drm/i915/i915_cache.c | 91 +++ > drivers/gpu/drm/i915/i915_cache.h | 60 +++ > drivers/gpu/drm/i915/i915_debugfs.c | 53 +- > drivers/gpu/drm/i915/i915_driver.c| 5 + > drivers/gpu/drm/i915/i915_drv.h | 5 + > drivers/gpu/drm/i915/i915_gem.c | 21 +-- > drivers/gpu/drm/i915/i915_gpu_error.c | 7 +- > drivers/gpu/drm/i915/i915_pci.c | 82 +- > drivers/gpu/drm/i915/i915_perf.c | 2 +- > drivers/gpu/drm/i915
[PATCH v3] drm/i915: Refactor PAT/object cache handling
From: Tvrtko Ursulin Commit 9275277d5324 ("drm/i915: use pat_index instead of cache_level") has introduced PAT indices to i915 internal APIs, partially replacing the usage of driver internal cache_level, but has also added a few questionable design decisions which this patch tries to improve upon. Principal change is to invert the per platform cache level to PAT index table which was added by the referenced commit, and by doing so enable i915 to understand the cache mode between PAT indices, changing them from opaque to transparent. Once we have the inverted table we are able to remove the hidden false "return true" from i915_gem_object_has_cache_level. Other changes/fixes/improvements we are able to do: 1) Replace the enum i915_cache_level with i915_cache_t, composed of a more detailed representation of each cache mode (base mode plus flags). For instance this way we are able to express the difference between WB and 1-way coherent WB on Meteorlake. Which in turn enables us to map the i915 "cached" mode to the correct Meteorlake PAT index. 2) We can cache PAT indices of the caching modes used by the driver itself in struct drm_i915_private, which eliminates the runtime calls to i915_gem_get_pat_index from both high- and low-level i915 components. 3) We can also cache the caching modes used by the driver for coherent access and for display buffers. 4) Remove the incorrect references to enum i915_cache_level from low level PTE encode vfuncs, since those are actually given PAT indices by their callers. 5) Because i915 now understands PAT indices, we can remove the overly aggressive flushing triggered from i915_gem_object_can_bypass_llc() and limit it to non-coherent write-back mode only. 6) Finally we are able to replace the platform dependent cache mode to string code in debugfs and elsewhere by the single implementation based on i915_cache_t. v2: * Fix PAT-to-cache-mode table for PVC. (Fei) * Cache display caching mode too. (Fei) * Improve and document criteria in i915_gem_object_can_bypass_llc() (Matt) v3: * Checkpath issues. * Cache mode flags check fixed. Signed-off-by: Tvrtko Ursulin Fixes: 9275277d5324 ("drm/i915: use pat_index instead of cache_level") Cc: Chris Wilson Cc: Fei Yang Cc: Andi Shyti Cc: Matt Roper --- drivers/gpu/drm/i915/Makefile | 1 + .../drm/i915/display/intel_plane_initial.c| 3 +- drivers/gpu/drm/i915/gem/i915_gem_domain.c| 56 --- drivers/gpu/drm/i915/gem/i915_gem_domain.h| 5 +- .../gpu/drm/i915/gem/i915_gem_execbuffer.c| 13 +- drivers/gpu/drm/i915/gem/i915_gem_internal.c | 4 +- drivers/gpu/drm/i915/gem/i915_gem_mman.c | 12 +- drivers/gpu/drm/i915/gem/i915_gem_object.c| 152 +++--- drivers/gpu/drm/i915/gem/i915_gem_object.h| 11 +- .../gpu/drm/i915/gem/i915_gem_object_types.h | 116 + drivers/gpu/drm/i915/gem/i915_gem_shmem.c | 8 +- drivers/gpu/drm/i915/gem/i915_gem_stolen.c| 11 +- drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c | 44 ++--- drivers/gpu/drm/i915/gem/i915_gem_userptr.c | 2 +- .../drm/i915/gem/selftests/huge_gem_object.c | 4 +- .../gpu/drm/i915/gem/selftests/huge_pages.c | 6 +- drivers/gpu/drm/i915/gt/gen6_ppgtt.c | 4 +- drivers/gpu/drm/i915/gt/gen8_ppgtt.c | 19 +-- drivers/gpu/drm/i915/gt/intel_engine_cs.c | 2 +- drivers/gpu/drm/i915/gt/intel_ggtt.c | 33 ++-- drivers/gpu/drm/i915/gt/intel_ggtt_gmch.c | 4 +- drivers/gpu/drm/i915/gt/intel_gtt.c | 2 +- drivers/gpu/drm/i915/gt/intel_gtt.h | 3 +- drivers/gpu/drm/i915/gt/intel_migrate.c | 11 +- drivers/gpu/drm/i915/gt/intel_ppgtt.c | 6 +- .../gpu/drm/i915/gt/intel_ring_submission.c | 4 +- drivers/gpu/drm/i915/gt/intel_timeline.c | 2 +- drivers/gpu/drm/i915/gt/selftest_hangcheck.c | 2 +- drivers/gpu/drm/i915/gt/selftest_migrate.c| 9 +- drivers/gpu/drm/i915/gt/selftest_reset.c | 14 +- drivers/gpu/drm/i915/gt/selftest_tlb.c| 5 +- .../gpu/drm/i915/gt/selftest_workarounds.c| 2 +- drivers/gpu/drm/i915/gt/uc/intel_uc_fw.c | 8 +- drivers/gpu/drm/i915/i915_cache.c | 91 +++ drivers/gpu/drm/i915/i915_cache.h | 60 +++ drivers/gpu/drm/i915/i915_debugfs.c | 53 +- drivers/gpu/drm/i915/i915_driver.c| 5 + drivers/gpu/drm/i915/i915_drv.h | 5 + drivers/gpu/drm/i915/i915_gem.c | 21 +-- drivers/gpu/drm/i915/i915_gpu_error.c | 7 +- drivers/gpu/drm/i915/i915_pci.c | 82 +- drivers/gpu/drm/i915/i915_perf.c | 2 +- drivers/gpu/drm/i915/intel_device_info.h | 6 +- drivers/gpu/drm/i915/selftests/i915_gem.c | 5 +- .../gpu/drm/i915/selftests/i915_gem_evict.c | 8 +- drivers/gpu/drm/i915/selftests/i915_gem_gtt.c | 13 +- drivers/gpu/drm/i915/selftests/igt_spinner.c | 2 +- .../drm/i915/selftes