Re: [Intel-gfx] [PATCH v5] drm/i915: Add IOCTL Param to control data port coherency.

2018-07-13 Thread Lis, Tomasz



On 2018-07-13 12:40, Tvrtko Ursulin wrote:


On 12/07/2018 16:10, Tomasz Lis wrote:
The patch adds a parameter to control the data port coherency 
functionality
on a per-context level. When the IOCTL is called, a command to switch 
data
port coherency state is added to the ordered list. All prior requests 
are
executed on old coherency settings, and all exec requests after the 
IOCTL

will use new settings.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is 
disabled

by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that 
is shared

between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature 
that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN 
architecture, adds
overhead related to tracking (snooping) memory inside different cache 
units

(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence 
require
memory coherency between CPU and GPU). The goal of coherency 
enable/disable
switch is to remove overhead of memory coherency when memory 
coherency is not

needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), 
Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" 
instructions.

These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O 
using plain
virtual addresses (as opposed to buffer_handle+offset models). This 
"stateless"
model is similar to regular memory load/store operations available on 
typical
CPUs. Since this model provides I/O using arbitrary virtual 
addresses, it
enables algorithmic designs that are based on pointer-to-pointer 
(e.g. buffer

of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
    
   |  NODE1 |
   | uint64_t data  |
   +|
   | NODE*  |  NODE*|
   ++---+
 /  \
    /    \
   |  NODE2 |    |  NODE3 |
   | uint64_t data  |    | uint64_t data  |
   +|    +|
   | NODE*  |  NODE*|    | NODE*  |  NODE*|
   ++---+    ++---+

Please note that pointers inside such structures can point to memory 
locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in 
one OCL

allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - 
Shared
Virtual Memory feature). Using pointers from different allocations 
doesn't
affect the stateless addressing model which even allows scattered 
reading from
different allocations at the same time (i.e. by utilizing SIMD-nature 
of send

instructions).

When it comes to coherency programming, send instructions in 
stateless model
can be encoded (at ISA level) to either use or disable coherency. 
However, for
generic OCL applications (such as example with tree-like data 
structure), OCL
compiler is not able to determine origin of memory pointed to by an 
arbitrary

pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is 
needed or not
for specific pointer (or for specific I/O instruction). As a result, 
compiler

encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that 
it would be
possible to workaround this (e.g. based on allocations map and 
pointer bounds

checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping 
coherency
always enabled. As such, enabling/disabling memory coherency at GEN 
ISA level

is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that 
allows

disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that 
actually need

coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance 
impact)


3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling 

Re: [Intel-gfx] [PATCH v5] drm/i915: Add IOCTL Param to control data port coherency.

2018-07-13 Thread Tvrtko Ursulin


On 12/07/2018 16:10, Tomasz Lis wrote:

The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:

   |  NODE1 |
   | uint64_t data  |
   +|
   | NODE*  |  NODE*|
   ++---+
 /  \
/\
   |  NODE2 ||  NODE3 |
   | uint64_t data  || uint64_t data  |
   +|+|
   | NODE*  |  NODE*|| NODE*  |  NODE*|
   ++---+++---+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and 

[Intel-gfx] [PATCH v5] drm/i915: Add IOCTL Param to control data port coherency.

2018-07-12 Thread Tomasz Lis
The patch adds a parameter to control the data port coherency functionality
on a per-context level. When the IOCTL is called, a command to switch data
port coherency state is added to the ordered list. All prior requests are
executed on old coherency settings, and all exec requests after the IOCTL
will use new settings.

Rationale:

The OpenCL driver develpers requested a functionality to control cache
coherency at data port level. Keeping the coherency at that level is disabled
by default due to its performance costs. OpenCL driver is planning to
enable it for a small subset of submissions, when such functionality is
required. Below are answers to basic question explaining background
of the functionality and reasoning for the proposed implementation:

1. Why do we need a coherency enable/disable switch for memory that is shared
between CPU and GEN (GPU)?

Memory coherency between CPU and GEN, while being a great feature that enables
CL_MEM_SVM_FINE_GRAIN_BUFFER OCL capability on Intel GEN architecture, adds
overhead related to tracking (snooping) memory inside different cache units
(L1$, L2$, L3$, LLC$, etc.). At the same time, minority of modern OCL
applications actually use CL_MEM_SVM_FINE_GRAIN_BUFFER (and hence require
memory coherency between CPU and GPU). The goal of coherency enable/disable
switch is to remove overhead of memory coherency when memory coherency is not
needed.

2. Why do we need a global coherency switch?

In order to support I/O commands from within EUs (Execution Units), Intel GEN
ISA (GEN Instruction Set Assembly) contains dedicated "send" instructions.
These send instructions provide several addressing models. One of these
addressing models (named "stateless") provides most flexible I/O using plain
virtual addresses (as opposed to buffer_handle+offset models). This "stateless"
model is similar to regular memory load/store operations available on typical
CPUs. Since this model provides I/O using arbitrary virtual addresses, it
enables algorithmic designs that are based on pointer-to-pointer (e.g. buffer
of pointers) concepts. For instance, it allows creating tree-like data
structures such as:
   
  |  NODE1 |
  | uint64_t data  |
  +|
  | NODE*  |  NODE*|
  ++---+
/  \
   /\
  |  NODE2 ||  NODE3 |
  | uint64_t data  || uint64_t data  |
  +|+|
  | NODE*  |  NODE*|| NODE*  |  NODE*|
  ++---+++---+

Please note that pointers inside such structures can point to memory locations
in different OCL allocations  - e.g. NODE1 and NODE2 can reside in one OCL
allocation while NODE3 resides in a completely separate OCL allocation.
Additionally, such pointers can be shared with CPU (i.e. using SVM - Shared
Virtual Memory feature). Using pointers from different allocations doesn't
affect the stateless addressing model which even allows scattered reading from
different allocations at the same time (i.e. by utilizing SIMD-nature of send
instructions).

When it comes to coherency programming, send instructions in stateless model
can be encoded (at ISA level) to either use or disable coherency. However, for
generic OCL applications (such as example with tree-like data structure), OCL
compiler is not able to determine origin of memory pointed to by an arbitrary
pointer - i.e. is not able to track given pointer back to a specific
allocation. As such, it's not able to decide whether coherency is needed or not
for specific pointer (or for specific I/O instruction). As a result, compiler
encodes all stateless sends as coherent (doing otherwise would lead to
functional issues resulting from data corruption). Please note that it would be
possible to workaround this (e.g. based on allocations map and pointer bounds
checking prior to each I/O instruction) but the performance cost of such
workaround would be many times greater than the cost of keeping coherency
always enabled. As such, enabling/disabling memory coherency at GEN ISA level
is not feasible and alternative method is needed.

Such alternative solution is to have a global coherency switch that allows
disabling coherency for single (though entire) GPU submission. This is
beneficial because this way we:
* can enable (and pay for) coherency only in submissions that actually need
coherency (submissions that use CL_MEM_SVM_FINE_GRAIN_BUFFER resources)
* don't care about coherency at GEN ISA granularity (no performance impact)

3. Will coherency switch be used frequently?

There are scenarios that will require frequent toggling of the coherency
switch.
E.g. an application has two OCL compute kernels: kern_master and kern_worker.
kern_master uses, concurrently with CPU, some fine