HI,
On 2020-02-04 01:48:49 +1300, Thomas Munro wrote:
> On Fri, Apr 12, 2019 at 4:23 PM Thomas Munro <[email protected]> wrote:
> > ... if we also prefetch during
> > the build phase ...
>
> Here's an experimental patch to investigate just that part. I tried
> initiating a prefetch of the bucket just before we copy the tuple and
> then finally insert it, but it doesn't seem to be far enough apart (at
> least for small tuples), which is a shame because that'd be a one line
> patch. So I made a little insert queue that prefetches and defers the
> insertion until N tuples later, and then I managed to get between 10%
> and 20% speed-up for contrived tests like this:
How much of the benefit here comes from the prefetching, and how much
just from writing the code in a manner that allows for more out-of-order
execution? Because there's no dependencies preventing execution of the
next queued tuple anymore, I'd assume that this is a good part what
helps?
Code like this really should look something roughly like:
while (true)
have_skew = False
# get tuples
for i in 0..batchsize:
tuples[i] = ExecProcNode(outerNode);
if tuples[i] == NULL:
# have slowpath handle this
break;
# compute their hash values
for i in 0..batchsize:
hashvalues[i] = ExecHashGetHashValue(tuples[i])
# check whether go into skew buckets
if have_skew_table:
for i in 0..batchsize:
skewbuckets[i] = ExecHashGetSkewBucket(tuples[i], hashvalues[i])
if (skewbuckets[i] != INVALID_SKEW_BUCKET_NO)
have_skew = True
if have_skew:
# handle everything here
continue
# assume there's no skew tuples going forward, all handled above
# compute bucket/batch for all tuples
have_into_batch = False
for i in 0..batchsize:
hashbuckets[i] = ExecHashGetBucketAndBatch()
if hashbuckets[i] != hashtable->curbatch:
have_into_batchfile = True
if have_into_batchfile:
# slowpath
continue
# Allocate all tuples
for i in 0..batchsize:
hjtuples[i] = alloc_mintuple(hashtuples[i])
# And finally isert them
for i in 0..batchsize:
hjtuple.next = buckets[hashbuckets[i]]
buckets[hashbuckets[i]] = hjtuple
Obviously it's a bit more complicated in reality than this, but I do
think that's where we've to go to make crucial parts like this faster
(same with hashaggs, and a few other places).
I would bet this helps significantly even if there's no prefetch
instruction - but using explicit prefetching might help further. Also
allows us to increase branch costs, because we can amortize them over a
few tuples.
> create unlogged table t as select generate_series(1, 100000000)::int i;
> select pg_prewarm('t');
> set work_mem='8GB';
>
> select count(*) from t t1 join t t2 using (i);
>
> master patched/N=1 patched/N=4
> workers=0 89.808s 80.556s (+11%) 76.864 (+16%)
> workers=2 27.010s 22.679s (+19%) 23.503 (+15%)
> workers=4 16.728s 14.896s (+12%) 14.167 (+18%)
>
> Just an early experiment, but I though it looked promising enough to share.
Nice!
Greetings,
Andres Freund