On Wed, Feb 26, 2020 at 7:59 PM Konstantin Knizhnik < k.knizh...@postgrespro.ru> wrote:
> > > On 26.02.2020 13:11, Hubert Zhang wrote: > > > >> and with JIT: >> >> 13.88% postgres postgres [.] tts_buffer_heap_getsomeattrs >> 7.15% postgres vectorize_engine.so [.] vfloat8_accum >> 6.03% postgres postgres [.] HeapTupleSatisfiesVisibility >> 5.55% postgres postgres [.] bpchareq >> 4.42% postgres vectorize_engine.so [.] VExecStoreColumns >> 4.19% postgres postgres [.] hashbpchar >> 4.09% postgres vectorize_engine.so [.] vfloat8pl >> >> > I also tested Q1 with your latest code. Result of vectorized is still slow. > PG13 native: 38 secs > PG13 Vec: 30 secs > PG13 JIT: 23 secs > PG13 JIT+Vec: 27 secs > > > It is strange that your results are much slower than my and profile is > very different. > Which postgres configuration you are using? > > ./configure CFLAGS="-O3 -g -march=native" --prefix=/usr/local/pgsql/ --disable-cassert --enable-debug --with-llvm I also use `PGXS := $(shell $(PG_CONFIG) --pgxs)` to compile vectorized_engine. So it will share the same compile configuration. My perf result is as belows. There are three parts: > 1. lookup_hash_entry(43.5%) this part is not vectorized yet. > > It is vectorized in some sense: lookup_hash_entry performs bulk of hash > lookups and pass array with results of such lookups to aggregate transmit > functions. > It will be possible to significantly increase speed of HashAgg if we store > data in order of grouping attributes and use RLE (run length encoding) to > peform just one > hash lookup for group of values. But it requires creation of special > partitions (like it is done in Vertica and VOPS). > > Yes, Vertica's partition needed to be pre-sorted on user defined columns. So for TPCH Q1 on Postgres, we could not have that assumption. And my Q1 plan uses HashAgg instead of GroupAgg based on cost. > 2. scan part: fetch_input_tuple(36%) > 3. vadvance_aggregates part(20%) > I also perfed on PG96 vectorized version and got similar perf results and > running time of vectorized PG96 and PG13 are also similar. But PG13 is much > faster than PG96. So I just wonder whether we merge all the latest executor > code of PG13 into the vectorized PG13 branch? > > > Sorry, I do not understand the question. vectorize_executor contains > patched versions of nodeSeqscan and nodeAgg from standard executor. > When performing porting to PG13, I took the latest version of nodeAgg and > tried to apply your patches to it. Certainly not always it was possible and > I have to rewrite a lt of places. Concerning nodeSeqscan - I took old > version from vectorize_executor and port it to PG13. > > It is strange that I am not seeing lookup_hash_entry in profile in my > case. > > So you already have the PG13 nodeAgg, that is good. Yes, it is strange. Hash table probing is always the costly part. My perf command `perf record --call-graph dwarf -p pid` Could you share your lineitem schema and Q1 query? My schema and Q1 query are: CREATE TABLE lineitem ( l_orderkey BIGINT NOT NULL, l_partkey INTEGER NOT NULL, l_suppkey INTEGER NOT NULL, l_linenumber INTEGER NOT NULL, l_quantity double precision NOT NULL, l_extendedprice double precision NOT NULL, l_discount double precision NOT NULL, l_tax double precision NOT NULL, l_returnflag CHAR(1) NOT NULL, l_linestatus CHAR(1) NOT NULL, l_shipdate DATE NOT NULL, l_commitdate DATE NOT NULL, l_receiptdate DATE NOT NULL, l_shipinstruct CHAR(25) NOT NULL, l_shipmode CHAR(10) NOT NULL, l_comment VARCHAR(44) NOT NULL ); select l_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price, sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, avg(l_quantity) as avg_qty, avg(l_extendedprice) as avg_price, avg(l_discount) as avg_disc, count(l_discount) as count_order from lineitem where l_shipdate <= date '1998-12-01' - interval '106 day' group by l_returnflag, l_linestatus ; -- Thanks Hubert Zhang