Hi Ralf,
Good to see some merging of efforts.
On 02/01/2013 10:22 AM, Ralf Karrenberg wrote:
> For you on the other hand, I guess the benefit should not require a lot
> of explanation, so I hope to get some help on where to start :).
Lately I have been trying to think how to implement *performance
portable* work group parallelization (including vectorization) support
with a modularized approach (as generic passes as possible).
Unfortunately, as this is not currently my main priority (in the research
job at TUT), I'm not sure how fast this will progress. Therefore,
I'm interested to see how much of the WFV work can help to go towards
this direction also.
AFAIU, in the pocl's point of view, WFV is a method to generate multi-WI
work group functions out from the single work item kernel bitcodes
produced by Clang.
Currently, there are two main methods for producing the multi-WI WGs in
pocl: replication and loops.
'loops' method generates parallel loops for the local size, thus iterates
across all the work items. This part I've begun to move towards a basis
for a loop vectorizer based implementation.
The replication method chains all the work items so is analogous to
fully unrolling these loops. It was originally written for
our VLIW-like target (TTA) where unrolling is desirable, before we
decided to open source the OpenCL work. These original passes and
the basis for pocl were written by Carlos Sánchez de La Lama during
our TTA collaboration years.
The loop-generation method was written by me last year mainly to reduce
the program footprint for larger local sizes, and also the loop-based
vectorization in mind.
These two methods can be merged at some point as fully unrolling the loops
should produce similar effects as replicating them the first place.
Currently the main reason they are separate is that we (the TUT's
TCE team) rely on the replication method (especially its parallelism
metadata) in our research processors' compilation chain.
In docs/env.txt you can see that there's an environment variable
POCL_WORK_GROUP_METHOD which can be used to select the work group
generation method:
"The kernel compiler method to produce the work group functions from
multiple work items. Legal values:
auto -- Choose the best available method depending on the
kernel and the work group size (default). Use
POCL_FULL_REPLICATION_THRESHOLD=N to set the
maximum local size for a work group to be
replicated fully with 'repl'. Otherwise,
'loops' is used.
loops -- Create for-loops that execute the work items
(under stabilization). The drawback is the
need to save the thread contexts in arrays.
The loops will be unrolled a certain number of
times of which maximum can be controlled with
POCL_WILOOPS_MAX_UNROLL_COUNT=N environment
variable (default is to not perform unrolling).
loopvec -- Create work-item for-loops (see 'loops') and execute
the LLVM LoopVectorizer. The loops are not unrolled
but the unrolling decision is left to the generic
LLVM passes.
repl -- Replicate and chain all work items. This results
in more easily scalarizable private variables.
However, the code bloat is increased with larger
local sizes."
The code base is organized such that /scripts contains some
shell scripts used to drive the different kernel compilation phases.
We wish to get rid of these scripts at some point as they reduce
the portability aspect etc. Kalle Raiskila has been working on a version
which calls LLVM APIs etc. directly from the host library instead of
using the scripts. For the kernel library (which is mostly contributed
by Erik Schnetter), it's located in lib/kernel where targets
can override the defaults by placing their versions under
target subdirs in the directory (including get_local_id etc.).
Anyways, pocl-workgroup is the script of interest here. At the end of
it you see the list of opt passes it now executes to produce the
work group functions. The passes itself are located in
lib/llvmopencl. The main complexity is in the parallel region formation
which detects the regions between barriers (Kernel::getParallelRegions())
which can be then replicated, looped or something else (even
a thread-based implementation could be done if it seems to
be the best option for the target at hand).
'repl' is implemented in WorkitemReplication.cc and 'loops'
in WorkitemLoops. 'loopvec' uses the WorkitemLoops to generate
the loops (which are now annotated with the loop parallelism metadata
I've been trying to upstream to LLVM lately) and calls the
LLVM's inner loop vectorizer (in pocl-workgroup).
However, as I understood your WFV is a complete work group generation
solution that also detects the parallel regions during the process
etc., it can skip most of the default pocl-workgroup's optimization
pass list and replace it with its own (unless you want to modularize and
share code with the other methods). So, maybe the best way
is to define a new method, e.g. 'wfv'. Then, in pocl-workgroup,
if wfv is selected, you can then call your optimization passes,
or, e.g., a pass that wraps calls to your library.
Hopefully this will get you started. I'll be happy to answer any
further questions, here in the list or in #pocl.
In practical terms, you could push a branch to https://code.launchpad.net/pocl
which we can then merge to trunk after reaching some level of
stability.
The test suite executed with 'make check' contains regression tests
and can also run some external OpenCL projects automatically.
Carlos, I'd be glad to hear your thoughts on all of this.
BR,
--
--Pekka
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
_______________________________________________
pocl-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/pocl-devel