Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming (including JIT)

Andres Freund Tue, 06 Dec 2016 15:23:09 -0800

On 2016-12-06 13:27:14 -0800, Peter Geoghegan wrote:
> On Mon, Dec 5, 2016 at 7:49 PM, Andres Freund <[email protected]> wrote:
> > I tried to address 2) by changing the C implementation. That brings some
> > measurable speedups, but it's not huge. A bigger speedup is making
> > slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers;
> > but it's still not huge.  Finally I turned to just-in-time (JIT)
> > compiling the code for tuple deforming. That doesn't save the cost of
> > 1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01).  The
> > first part is done in 0008, the JITing in 0012.
>
> A more complete motivating example would be nice. For example, it
> would be nice to see the overall speedup for some particular TPC-H
> query.


Well, it's a bit WIP-y for that - not all TPCH queries run JITed yet, as
I've not done that for enough expression types... And you run quickly
into other bottlenecks.

But here we go for TPCH (scale 10) Q01:
master:
Time: 33885.381 ms
  16.29%  postgres  postgres          [.] slot_getattr
  12.85%  postgres  postgres          [.] ExecMakeFunctionResultNoSets
  10.85%  postgres  postgres          [.] advance_aggregates
   6.91%  postgres  postgres          [.] slot_deform_tuple
   6.70%  postgres  postgres          [.] advance_transition_function
   4.59%  postgres  postgres          [.] ExecProject
   4.25%  postgres  postgres          [.] float8_accum
   3.69%  postgres  postgres          [.] tuplehash_insert
   2.39%  postgres  postgres          [.] float8pl
   2.20%  postgres  postgres          [.] bpchareq
   2.03%  postgres  postgres          [.] check_stack_depth

profile:

(note that all expression evaluated things are distributed among many
functions)

dev (no jiting):
Time: 30343.532 ms

profile:
  16.57%  postgres  postgres          [.] slot_deform_tuple
  13.39%  postgres  postgres          [.] ExecEvalExpr
   8.64%  postgres  postgres          [.] advance_aggregates
   8.58%  postgres  postgres          [.] advance_transition_function
   5.83%  postgres  postgres          [.] float8_accum
   5.14%  postgres  postgres          [.] tuplehash_insert
   3.89%  postgres  postgres          [.] float8pl
   3.60%  postgres  postgres          [.] slot_getattr
   2.66%  postgres  postgres          [.] bpchareq
   2.56%  postgres  postgres          [.] heap_getnext

dev (jiting):
SET jit_tuple_deforming = on;
SET jit_expressions = true;

Time: 24439.803 ms

profile:
  11.11%  postgres  postgres             [.] slot_deform_tuple
  10.87%  postgres  postgres             [.] advance_aggregates
   9.74%  postgres  postgres             [.] advance_transition_function
   6.53%  postgres  postgres             [.] float8_accum
   5.25%  postgres  postgres             [.] tuplehash_insert
   4.31%  postgres  perf-10698.map       [.] deform0
   3.68%  postgres  perf-10698.map       [.] evalexpr6
   3.53%  postgres  postgres             [.] slot_getattr
   3.41%  postgres  postgres             [.] float8pl
   2.84%  postgres  postgres             [.] bpchareq

(note how expression eval when from 13.39% to roughly 4%)

The slot_deform_cost here is primarily cache misses. If you do the
"memory order" iteration, it drops significantly.

The JIT generated code still leaves a lot on the table, i.e. this is
definitely not the best we can do.  We also deform half the tuple twice,
because I've not yet added support for starting to deform in the middle
of a tuple.

Independent of new expression evaluation and/or JITing, if you make
advance_aggregates and advance_transition_function inline functions (or
you do profiling accounting for children), you'll notice that ExecAgg()
+ advance_aggregates + advance_transition_function themselves take up
about 20% cpu-time.  That's *not* including the hashtable management,
the actual transition functions, and such themselves.


If you have queries where tuple deforming is a bigger proportion of the
load, or where expression evalution (including projection) is a larger
part (any NULLs e.g.) you can get a lot bigger wins, even without
actually optimizing the generated code (which I've not yet done).

Just btw: float8_accum really should use an internal aggregation type
instead of using postgres array...


Andres


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming (including JIT)

Reply via email to