Re: [Intel-gfx] [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86

2014-12-15 Thread Ben Widawsky
On Sun, Dec 14, 2014 at 08:06:20PM -0800, Jesse Barnes wrote:
 On 12/14/2014 4:59 AM, Chris Wilson wrote:
 One of the things wbinvd is considered evil for is that it blocks the
 CPU for an indeterminate amount of time - upsetting latency critcial
 aspects of the OS. For example, the x86/mm has similar code to use
 wbinvd for large clflushes that caused a bit of controversy with RT:
 
 http://linux-kernel.2935.n7.nabble.com/PATCH-x86-Use-clflush-instead-of-wbinvd-whenever-possible-when-changing-mapping-td493751.html
 
 and also the use of wbinvd in the nvidia driver has also been noted as
 evil by RT folks.
 
 However as the wbinvd still exists, it can't be all that bad...

That patch looks eerily similar. I guess I am slightly better in that I take the
cache size into account.

 
 Yeah there are definitely tradeoffs here.  In this particular case, we're
 trying to flush out a ~140M object on every frame, which just seems silly.
 
 There's definitely room for optimization in Mesa too; avoiding a mapping
 that marks a large bo as dirty would be good, but if we improve the kernel
 in this area, even sloppy apps and existing binaries will speed up.
 
 Maybe we could apply this only on !llc systems or something?  I wonder how
 much wbinvd performance varies across microarchitectures; maybe Thomas's
 issue isn't really relevant anymore (at least one can hope).

I am pretty sure from the mesa perspective we do not hit this path on non-llc
systems because we end up with a linear buffer, and just CPU map the buffer. In
addition to trying to manage the cache dirtiness ourselves, the current code
which determines how it wants to manage subregions within large textures could
probably use some eyes to make sure the order in which we decide the various
fallbacks makes sense for all sizes, and all platforms. I /think/ we can do
better, but when I was writing a patch, the code got messy fast.

 
 When digging into this, I found that an optimization to remove the IPI for
 wbinvd was clobbered during a merge; maybe that should be resurrected too.
 Surely a single, global wbinvd is sufficient; we don't need to do n_cpus^2
 wbinvd + the associated invalidation bus signals here...

If this actually works, then there should be no CPU stall at all.

 
 Alternately, we could insert some delays into this path just to make it
 extra clear to userspace that they really shouldn't be hitting this in the
 common case (and provide some additional interfaces to let them avoid it by
 allowing flushing and dirty management in userspace).

I don't think such an extreme step is needed. If we really don't need the IPI,
then I think this path can only be faster than CLFLUSH.

 
 Jesse
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86

2014-12-15 Thread Ben Widawsky
On Sat, Dec 13, 2014 at 08:15:22PM -0800, Matt Turner wrote:
 On Sat, Dec 13, 2014 at 7:08 PM, Ben Widawsky
 benjamin.widaw...@intel.com wrote:
  Any GEM driver which has very large objects and a slow CPU is subject to 
  very
  long waits simply for clflushing incoherent objects. Generally, each 
  individual
  object is not a problem, but if you have very large objects, or very many
  objects, the flushing begins to show up in profiles. Because on x86 we know 
  the
  cache size, we can easily determine when an object will use all the cache, 
  and
  forego iterating over each cacheline.
 
  We need to be careful when using wbinvd. wbinvd() is itself potentially slow
  because it requires synchronizing the flush across all CPUs so they have a
  coherent view of memory. This can result in either stalling work being done 
  on
  other CPUs, or this call itself stalling while waiting for a CPU to accept 
  the
  interrupt. Also, wbinvd() also has the downside of invalidating all 
  cachelines,
  so we don't want to use it unless we're sure we already own most of the
  cachelines.
 
  The current algorithm is very naive. I think it can be tweaked more, and it
  would be good if someone else gave it some thought. I am pretty confident in
  i915, we can even skip the IPI in the execbuf path with minimal code change 
  (or
  perhaps just some verifying of the existing code). It would be nice to hear 
  what
  other developers who depend on this code think.
 
  Cc: Intel GFX intel-gfx@lists.freedesktop.org
  Signed-off-by: Ben Widawsky b...@bwidawsk.net
  ---
   drivers/gpu/drm/drm_cache.c | 20 +---
   1 file changed, 17 insertions(+), 3 deletions(-)
 
  diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
  index d7797e8..6009c2d 100644
  --- a/drivers/gpu/drm/drm_cache.c
  +++ b/drivers/gpu/drm/drm_cache.c
  @@ -64,6 +64,20 @@ static void drm_cache_flush_clflush(struct page *pages[],
  drm_clflush_page(*pages++);
  mb();
   }
  +
  +static bool
  +drm_cache_should_clflush(unsigned long num_pages)
  +{
  +   const int cache_size = boot_cpu_data.x86_cache_size;
  +
  +   /* For now the algorithm simply checks if the number of pages to be
  +* flushed is greater than the entire system cache. One could make 
  the
  +* function more aware of the actual system (ie. if SMP, how large 
  is
  +* the cache, CPU freq. etc. All those help to determine when to
  +* wbinvd() */
  +   WARN_ON_ONCE(!cache_size);
  +   return !cache_size || num_pages  (cache_size  2);
  +}
   #endif
 
   void
  @@ -71,7 +85,7 @@ drm_clflush_pages(struct page *pages[], unsigned long 
  num_pages)
   {
 
   #if defined(CONFIG_X86)
  -   if (cpu_has_clflush) {
  +   if (cpu_has_clflush  drm_cache_should_clflush(num_pages)) {
  drm_cache_flush_clflush(pages, num_pages);
  return;
  }
  @@ -104,7 +118,7 @@ void
   drm_clflush_sg(struct sg_table *st)
   {
   #if defined(CONFIG_X86)
  -   if (cpu_has_clflush) {
  +   if (cpu_has_clflush  drm_cache_should_clflush(st-nents)) {
  struct sg_page_iter sg_iter;
 
  mb();
  @@ -128,7 +142,7 @@ void
   drm_clflush_virt_range(void *addr, unsigned long length)
   {
   #if defined(CONFIG_X86)
  -   if (cpu_has_clflush) {
  +   if (cpu_has_clflush  drm_cache_should_clflush(length / 
  PAGE_SIZE)) {
 
 If length isn't a multiple of page size, isn't this ignoring the
 remainder? Should it be rounding length up to the next multiple of
 PAGE_SIZE, like ROUND_UP_TO?

Yeah, we could round_up. In practice it probably won't matter. I actually think
it would be better to pass a size to drm_cache_should_clflush(), and let that
round it up. It sounds like people don't want this patch anyway, so I'll make
the equivalent change in the i915 only patch.

-- 
Ben Widawsky, Intel Open Source Technology Center
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86

2014-12-14 Thread Chris Wilson
On Sat, Dec 13, 2014 at 07:08:22PM -0800, Ben Widawsky wrote:
 Any GEM driver which has very large objects and a slow CPU is subject to very
 long waits simply for clflushing incoherent objects. Generally, each 
 individual
 object is not a problem, but if you have very large objects, or very many
 objects, the flushing begins to show up in profiles. Because on x86 we know 
 the
 cache size, we can easily determine when an object will use all the cache, and
 forego iterating over each cacheline.
 
 We need to be careful when using wbinvd. wbinvd() is itself potentially slow
 because it requires synchronizing the flush across all CPUs so they have a
 coherent view of memory. This can result in either stalling work being done on
 other CPUs, or this call itself stalling while waiting for a CPU to accept the
 interrupt. Also, wbinvd() also has the downside of invalidating all 
 cachelines,
 so we don't want to use it unless we're sure we already own most of the
 cachelines.
 
 The current algorithm is very naive. I think it can be tweaked more, and it
 would be good if someone else gave it some thought. I am pretty confident in
 i915, we can even skip the IPI in the execbuf path with minimal code change 
 (or
 perhaps just some verifying of the existing code). It would be nice to hear 
 what
 other developers who depend on this code think.

One of the things wbinvd is considered evil for is that it blocks the
CPU for an indeterminate amount of time - upsetting latency critcial
aspects of the OS. For example, the x86/mm has similar code to use
wbinvd for large clflushes that caused a bit of controversy with RT:

http://linux-kernel.2935.n7.nabble.com/PATCH-x86-Use-clflush-instead-of-wbinvd-whenever-possible-when-changing-mapping-td493751.html

and also the use of wbinvd in the nvidia driver has also been noted as
evil by RT folks.

However as the wbinvd still exists, it can't be all that bad...

 Cc: Intel GFX intel-gfx@lists.freedesktop.org
 Signed-off-by: Ben Widawsky b...@bwidawsk.net
 ---
  drivers/gpu/drm/drm_cache.c | 20 +---
  1 file changed, 17 insertions(+), 3 deletions(-)
 
 diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
 index d7797e8..6009c2d 100644
 --- a/drivers/gpu/drm/drm_cache.c
 +++ b/drivers/gpu/drm/drm_cache.c
 @@ -64,6 +64,20 @@ static void drm_cache_flush_clflush(struct page *pages[],
   drm_clflush_page(*pages++);
   mb();
  }
 +
 +static bool
 +drm_cache_should_clflush(unsigned long num_pages)
 +{
 + const int cache_size = boot_cpu_data.x86_cache_size;

Maybe const int cpu_cache_size = boot_cpu_data.x86_cache_size  2; /* in pages 
*/

Just to make it clear where the factor of 4 is required.

How stand alone is this patch? What happens if you just apply this by
itself? I presume it wasn't all that effective since you needed the
additional patches to prevent superfluous flushes. But it should have an
effect to reduce the time it takes to bind framebuffers, etc. I expect
it to overall worsen performance as we do the repeated wbinvd in
execbuffer.
-Chris

-- 
Chris Wilson, Intel Open Source Technology Centre
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86

2014-12-14 Thread Jesse Barnes

On 12/14/2014 4:59 AM, Chris Wilson wrote:

One of the things wbinvd is considered evil for is that it blocks the
CPU for an indeterminate amount of time - upsetting latency critcial
aspects of the OS. For example, the x86/mm has similar code to use
wbinvd for large clflushes that caused a bit of controversy with RT:

http://linux-kernel.2935.n7.nabble.com/PATCH-x86-Use-clflush-instead-of-wbinvd-whenever-possible-when-changing-mapping-td493751.html

and also the use of wbinvd in the nvidia driver has also been noted as
evil by RT folks.

However as the wbinvd still exists, it can't be all that bad...


Yeah there are definitely tradeoffs here.  In this particular case, 
we're trying to flush out a ~140M object on every frame, which just 
seems silly.


There's definitely room for optimization in Mesa too; avoiding a mapping 
that marks a large bo as dirty would be good, but if we improve the 
kernel in this area, even sloppy apps and existing binaries will speed up.


Maybe we could apply this only on !llc systems or something?  I wonder 
how much wbinvd performance varies across microarchitectures; maybe 
Thomas's issue isn't really relevant anymore (at least one can hope).


When digging into this, I found that an optimization to remove the IPI 
for wbinvd was clobbered during a merge; maybe that should be 
resurrected too.  Surely a single, global wbinvd is sufficient; we don't 
need to do n_cpus^2 wbinvd + the associated invalidation bus signals here...


Alternately, we could insert some delays into this path just to make it 
extra clear to userspace that they really shouldn't be hitting this in 
the common case (and provide some additional interfaces to let them 
avoid it by allowing flushing and dirty management in userspace).


Jesse
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


[Intel-gfx] [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86

2014-12-13 Thread Ben Widawsky
Any GEM driver which has very large objects and a slow CPU is subject to very
long waits simply for clflushing incoherent objects. Generally, each individual
object is not a problem, but if you have very large objects, or very many
objects, the flushing begins to show up in profiles. Because on x86 we know the
cache size, we can easily determine when an object will use all the cache, and
forego iterating over each cacheline.

We need to be careful when using wbinvd. wbinvd() is itself potentially slow
because it requires synchronizing the flush across all CPUs so they have a
coherent view of memory. This can result in either stalling work being done on
other CPUs, or this call itself stalling while waiting for a CPU to accept the
interrupt. Also, wbinvd() also has the downside of invalidating all cachelines,
so we don't want to use it unless we're sure we already own most of the
cachelines.

The current algorithm is very naive. I think it can be tweaked more, and it
would be good if someone else gave it some thought. I am pretty confident in
i915, we can even skip the IPI in the execbuf path with minimal code change (or
perhaps just some verifying of the existing code). It would be nice to hear what
other developers who depend on this code think.

Cc: Intel GFX intel-gfx@lists.freedesktop.org
Signed-off-by: Ben Widawsky b...@bwidawsk.net
---
 drivers/gpu/drm/drm_cache.c | 20 +---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
index d7797e8..6009c2d 100644
--- a/drivers/gpu/drm/drm_cache.c
+++ b/drivers/gpu/drm/drm_cache.c
@@ -64,6 +64,20 @@ static void drm_cache_flush_clflush(struct page *pages[],
drm_clflush_page(*pages++);
mb();
 }
+
+static bool
+drm_cache_should_clflush(unsigned long num_pages)
+{
+   const int cache_size = boot_cpu_data.x86_cache_size;
+
+   /* For now the algorithm simply checks if the number of pages to be
+* flushed is greater than the entire system cache. One could make the
+* function more aware of the actual system (ie. if SMP, how large is
+* the cache, CPU freq. etc. All those help to determine when to
+* wbinvd() */
+   WARN_ON_ONCE(!cache_size);
+   return !cache_size || num_pages  (cache_size  2);
+}
 #endif
 
 void
@@ -71,7 +85,7 @@ drm_clflush_pages(struct page *pages[], unsigned long 
num_pages)
 {
 
 #if defined(CONFIG_X86)
-   if (cpu_has_clflush) {
+   if (cpu_has_clflush  drm_cache_should_clflush(num_pages)) {
drm_cache_flush_clflush(pages, num_pages);
return;
}
@@ -104,7 +118,7 @@ void
 drm_clflush_sg(struct sg_table *st)
 {
 #if defined(CONFIG_X86)
-   if (cpu_has_clflush) {
+   if (cpu_has_clflush  drm_cache_should_clflush(st-nents)) {
struct sg_page_iter sg_iter;
 
mb();
@@ -128,7 +142,7 @@ void
 drm_clflush_virt_range(void *addr, unsigned long length)
 {
 #if defined(CONFIG_X86)
-   if (cpu_has_clflush) {
+   if (cpu_has_clflush  drm_cache_should_clflush(length / PAGE_SIZE)) {
void *end = addr + length;
mb();
for (; addr  end; addr += boot_cpu_data.x86_clflush_size)
-- 
2.1.3

___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx


Re: [Intel-gfx] [PATCH 2/4] drm/cache: Try to be smarter about clflushing on x86

2014-12-13 Thread Matt Turner
On Sat, Dec 13, 2014 at 7:08 PM, Ben Widawsky
benjamin.widaw...@intel.com wrote:
 Any GEM driver which has very large objects and a slow CPU is subject to very
 long waits simply for clflushing incoherent objects. Generally, each 
 individual
 object is not a problem, but if you have very large objects, or very many
 objects, the flushing begins to show up in profiles. Because on x86 we know 
 the
 cache size, we can easily determine when an object will use all the cache, and
 forego iterating over each cacheline.

 We need to be careful when using wbinvd. wbinvd() is itself potentially slow
 because it requires synchronizing the flush across all CPUs so they have a
 coherent view of memory. This can result in either stalling work being done on
 other CPUs, or this call itself stalling while waiting for a CPU to accept the
 interrupt. Also, wbinvd() also has the downside of invalidating all 
 cachelines,
 so we don't want to use it unless we're sure we already own most of the
 cachelines.

 The current algorithm is very naive. I think it can be tweaked more, and it
 would be good if someone else gave it some thought. I am pretty confident in
 i915, we can even skip the IPI in the execbuf path with minimal code change 
 (or
 perhaps just some verifying of the existing code). It would be nice to hear 
 what
 other developers who depend on this code think.

 Cc: Intel GFX intel-gfx@lists.freedesktop.org
 Signed-off-by: Ben Widawsky b...@bwidawsk.net
 ---
  drivers/gpu/drm/drm_cache.c | 20 +---
  1 file changed, 17 insertions(+), 3 deletions(-)

 diff --git a/drivers/gpu/drm/drm_cache.c b/drivers/gpu/drm/drm_cache.c
 index d7797e8..6009c2d 100644
 --- a/drivers/gpu/drm/drm_cache.c
 +++ b/drivers/gpu/drm/drm_cache.c
 @@ -64,6 +64,20 @@ static void drm_cache_flush_clflush(struct page *pages[],
 drm_clflush_page(*pages++);
 mb();
  }
 +
 +static bool
 +drm_cache_should_clflush(unsigned long num_pages)
 +{
 +   const int cache_size = boot_cpu_data.x86_cache_size;
 +
 +   /* For now the algorithm simply checks if the number of pages to be
 +* flushed is greater than the entire system cache. One could make the
 +* function more aware of the actual system (ie. if SMP, how large is
 +* the cache, CPU freq. etc. All those help to determine when to
 +* wbinvd() */
 +   WARN_ON_ONCE(!cache_size);
 +   return !cache_size || num_pages  (cache_size  2);
 +}
  #endif

  void
 @@ -71,7 +85,7 @@ drm_clflush_pages(struct page *pages[], unsigned long 
 num_pages)
  {

  #if defined(CONFIG_X86)
 -   if (cpu_has_clflush) {
 +   if (cpu_has_clflush  drm_cache_should_clflush(num_pages)) {
 drm_cache_flush_clflush(pages, num_pages);
 return;
 }
 @@ -104,7 +118,7 @@ void
  drm_clflush_sg(struct sg_table *st)
  {
  #if defined(CONFIG_X86)
 -   if (cpu_has_clflush) {
 +   if (cpu_has_clflush  drm_cache_should_clflush(st-nents)) {
 struct sg_page_iter sg_iter;

 mb();
 @@ -128,7 +142,7 @@ void
  drm_clflush_virt_range(void *addr, unsigned long length)
  {
  #if defined(CONFIG_X86)
 -   if (cpu_has_clflush) {
 +   if (cpu_has_clflush  drm_cache_should_clflush(length / PAGE_SIZE)) {

If length isn't a multiple of page size, isn't this ignoring the
remainder? Should it be rounding length up to the next multiple of
PAGE_SIZE, like ROUND_UP_TO?
___
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/intel-gfx