Nice work! I sent some comments on patches 6 & 7. Patches 1-4 and 7-15 are
Reviewed-by: Nicolai Hähnle <nicolai.haeh...@amd.com>
assuming you checked them with a Piglit run in addition to the shader-db.
Something that LLVM does for its intermediate representations is using
Recycler objects. Instructions are allocated from a linear allocator,
but when they are removed they are neither returned to the heap nor
simply forgotten. Instead, the memory block is added to a linked list
managed by the Recycler object, so that the next instruction allocation
can be served from there.
I suspect that this could also help here because it's still very fast
but keeps the cache footprint smaller.
On 08.10.2016 12:58, Marek Olšák wrote:
This patch series reduces the number of malloc calls in the GLSL
compiler by 63%. That leads to better compile times and less heap
It's done by switching memory allocations in the GLSL compiler to my
new linear allocator that allocates out of a fixed-sized buffer with
a monotonically increasing offset. If more buffers are needed, it
The new allocator is used in all places where short-lived allocations
are used with a high number of malloc calls. The series also contains
other improvements not related to the new allocator that also improve
compile times. The results are below.
I tested my shader-db with shaders only being compiled to TGSI.
(noop gallium driver)
master + libc's malloc:
maxmem 275 MB
master + jemalloc preloaded:
maxmem 284 MB
the series + libc's malloc:
maxmem 270 MB
the series + jemalloc preloaded:
maxmem 284 MB
The series without jemalloc almost caught up with jemalloc + master.
However, jemalloc also benefits.
Current Mesa needs 54.182s and it drops to 40.729s with my series and
jemalloc. The total change in compile time is -25% if we incorporate
both. Without jemalloc, the difference is only -14.7%.
With radeonsi, the improvement is approx. slightly more than 1/2 of that
(if you add the LLVM time). However, radeonsi also has asynchronous
shader compilation hiding LLVM overhead in some cases, so it depends.
Drivers with faster compiler backends will benefit more than radeonsi,
but will probably not reach -25% or -14.7% (except softpipe, which uses
The memory usage looks reasonable in all tested cases.
Note: One of the first patches moves memset from ralloc to rzalloc.
I tested and fixed the GLSL source -> TGSI path, but other codepaths
may break, and you need to use valgrind to find all uninitialized
variables that relied on ralloc doing memset (if there are any).
You can also find it here:
src/compiler/glsl/ast.h | 4 +-
src/compiler/glsl/ast_to_hir.cpp | 4 +-
src/compiler/glsl/ast_type.cpp | 13 ++-
src/compiler/glsl/glcpp/glcpp-lex.l | 2 +-
src/compiler/glsl/glcpp/glcpp-parse.y | 203
src/compiler/glsl/glcpp/glcpp.h | 1 +
src/compiler/glsl/glsl_lexer.ll | 16 +--
src/compiler/glsl/glsl_parser.yy | 202
src/compiler/glsl/glsl_parser_extras.cpp | 6 +-
src/compiler/glsl/glsl_parser_extras.h | 4 +-
src/compiler/glsl/glsl_symbol_table.cpp | 19 ++--
src/compiler/glsl/glsl_symbol_table.h | 1 +
src/compiler/glsl/ir.cpp | 4 +
src/compiler/glsl/ir.h | 13 ++-
src/compiler/glsl/link_uniform_blocks.cpp | 2 +-
src/compiler/glsl/list.h | 2 +-
src/compiler/glsl/lower_packed_varyings.cpp | 8 +-
src/compiler/glsl/opt_constant_propagation.cpp | 14 ++-
src/compiler/glsl/opt_copy_propagation.cpp | 7 +-
src/compiler/glsl/opt_copy_propagation_elements.cpp | 19 ++--
src/compiler/glsl/opt_dead_code_local.cpp | 12 ++-
src/compiler/glsl_types.cpp | 38 +------
src/compiler/glsl_types.h | 6 +-
src/compiler/nir/nir.c | 8 +-
src/compiler/spirv/vtn_variables.c | 3 +-
src/gallium/drivers/freedreno/ir3/ir3.c | 2 +-
src/gallium/drivers/vc4/vc4_cl.c | 2 +-
src/gallium/drivers/vc4/vc4_program.c | 2 +-
src/gallium/drivers/vc4/vc4_simulator.c | 5 +-
src/mesa/drivers/dri/i965/brw_state_batch.c | 5 +-
src/util/ralloc.c | 392
src/util/ralloc.h | 93 ++++++++++++++++--
32 files changed, 782 insertions(+), 330 deletions(-)
mesa-dev mailing list
mesa-dev mailing list