On Fri, Apr 12, 2019 at 1:35 AM Robert Haas <robertmh...@gmail.com> wrote:
> It would be interesting to see how this does with moderately-long text
> keys, say 32 or 64 byte strings, and with actually-long text keys, say
> several kB, and then with gigantic text keys, say several MB.  At some
> point the key probably gets large enough that computing the hash value
> for the next key evicts the current key from the relevant CPU cache,
> and if I had to guess, at that point prefetching will become a loser.
> But I don't know where that point is.  If it turns out for example
> that this technique is a winner for pass-by-value datatypes and a
> loser for pass-by-reference datatypes, or that it's a winner always,
> or some sort of simple rule like that, awesome!  But if it turns out
> that there's no simple rule that we can use to know whether it wins or
> loses, then that might make things a little tricky.

Ok, I ran the attached script on an E5-2695 v3 @ 2.30GHz with 32K of
L1, 256K of L2, 30M of L3.  I used shared_buffers=16GB and prewarmed
all tables.

The short version: very mixed results.  For small hash tables it
clearly hurts, for large ones it looks promising.  Much more digging
required to draw any conclusions.

It'd be nice to understand exactly how the hidden parameters work.
Hash bucket array size vs hardware cache sizes must be a factor.
Another is surely timing; there is an optimal time to begin
prefetching (too soon and your line might be replaced by the time you
need it, too late and you won't avoid stalling).  Here, I am doing a
ham-fisted prefetch of t+1's bucket header and not yet trying to
prefetch the tuple itself, but the Intel/CMU paper[1] of course shows
a deeper pipeline and talks about calibrating the prefetch distance
for the hardware.  As far as I can see, that's quite tricky in
general, and complicated by our executor design: the number of cycles
before the node runs again depends on the higher-level plan!
Previously I have heard icache-based arguments for why we should be
able to emit more than one tuple at a time, and I suppose this is a
new argument for that: it makes it actually plausible to calibrate the
prefetch distance for "software-pipelined prefetch" techniques.

Anyway, here are the results for what they are worth:

Table r has 50,000,000 rows.  Table s was tested with 3 different
sizes.  Both tables have the same layout: key (various types and
sizes) + 0, 2 or 4 extra columns.  I ran each test 3 times, and
compared the worst (w) and median (m) times and computed the speed-up
provided by the patch.  The absolute times (not shown) were all in the
range 9-40 seconds depending depending on the parameters.

            s=1,000              s=100,000            s=10,000,000
            ==================== ==================== ====================
int         w=-10.86%, m=-11.03% w=-8.56%, m=-9.17%   w=+16.79%, m=+19.63%
 + 2 cols   w=-14.19%, m=-16.97% w=-6.59%, m=-7.89%   w=+15.88%, m=+16.81%
 + 4 cols   w=-17.14%, m=-14.34% w=-10.01%, m=-8.85%  w=+37.38%, m=+24.01%
text(8)     w=-12.42%, m=-12.32% w=-4.04%, m=-1.96%   w=+13.52%, m=+16.17%
 + 2 cols   w=-13.50%, m=-14.98% w=-4.29%, m=-3.40%   w=+15.98%, m=+19.45%
 + 4 cols   w=-11.53%, m=-9.61%  w=-3.70%, m=-5.91%   w=+46.66%, m=+51.06%
text(16)    w=-11.46%, m=-9.71%  w=-4.85%, m=-3.86%   w=+16.47%, m=+22.10%
 + 2 cols   w=-13.97%, m=-14.55% w=-7.08%, m=-6.07%   w=+20.50%, m=+21.77%
 + 4 cols   w=-9.72%, m=-11.31%  w=-1.03%, m=-2.55%   w=+8.25%, m=+12.21%
text(32)    w=-14.86%, m=-15.48% w=-9.36%, m=-9.27%   w=+19.86%, m=+15.34%
 + 2 cols   w=-12.35%, m=-11.71% w=-10.61%, m=-9.87%  w=+98.29%, m=+97.40%
 + 4 cols   w=-10.71%, m=-10.40% w=-2.42%, m=-1.25%   w=-8.34%, m=-10.44%
text(64)    w=-9.45%, m=-11.36%  w=-13.94%, m=-11.42% w=+9.44%, m=+9.57%
 + 2 cols   w=-12.47%, m=-13.17% w=-9.60%, m=-6.61%   w=-4.69%, m=+10.06%
 + 4 cols   w=-9.47%, m=-12.20%  w=-5.60%, m=-3.55%   w=-15.91%, m=-23.29%

I'd expect the right-hand column to get a further speed-up (or
slow-down) of around 1/5 the given numbers, if we also prefetch during
the build phase (= s/r).  Maybe 2-stage pipeline would help, though
I'm starting to see the complexity of organising a perfecting primed
memory pipeline ... ie what this topic is really all about.

Well, that's enough learning-basic-facts-about-computers by trying to
whack PostgreSQL with database literature for now.  Looks like I'll
have to work quite a bit harder to make something useful out of this.
I think I see where some of the problems lie.  I think being able to
store multiple tuples in a slot (via TTS-implementation-specific
means, as we see with the heap scan in my patch, and as I think hash
join would want to do to emit multiple tuples before relinquishing
control) and then look at them via ExecScrollSlot() and perhaps also
consume them via ExecNextSlot() are promising ideas, but I've only
scratched the surface.

[1] http://www.cs.cmu.edu/~chensm/papers/hashjoin_tods_preliminary.pdf


--
Thomas Munro
https://enterprisedb.com
r_size=5000

print "\\timing off"

print "set work_mem = '2GB';"
print "set max_parallel_workers_per_gather = 0;"

for hit_ratio in (1, 2, 4):
  for key_size in (0, 4, 8, 16, 32, 64):
    if key_size == 0:
      key_type = "int"
    else:
      key_type = "text"
    for extra_cols in range(4):
      for s_size in (1000, 10000, 100000, 1000000):
        print "drop table if exists r;"
        print "drop table if exists s;"
        print "create table r (k %s%s);" % (key_type, "".join([", c%d text" % i for i in range(extra_cols)]))
        print "create table s (k %s%s);" % (key_type, "".join([", c%d text" % i for i in range(extra_cols)]))
        if key_type == "int":
          print "insert into r select generate_series(0, %d) %% %d%s;" % (r_size, s_size * hit_ratio, ", 'hello world'" * extra_cols)
          print "insert into s select generate_series(0, %d)%s;" % (s_size, ", 'hello world'" * extra_cols)
        else:
          print "insert into r select to_char(generate_series(0, %d) %% %d, '%s')%s;" % (r_size, s_size * hit_ratio, "0" + "9" * key_size, ", 'hello world'" * extra_cols)
          print "insert into s select to_char(generate_series(0, %d), '%s')%s;" % (s_size, "0" + "9" * key_size, ", 'hello world'" * extra_cols)
        print "analyze;"
        print "select pg_prewarm('r');"
        print "select pg_prewarm('s');"

        print "\\timing on"
        print "\\echo ### BEGIN r_size = %s, s_size = %s, hit_ratio = %s, key_size = %s, extra_cols = %s" % (r_size, s_size, hit_ratio, key_size, extra_cols)
        print "select count(*) from r join s using (k);"
        print "select count(*) from r join s using (k);"
        print "select count(*) from r join s using (k);"
        print "\\echo ### END"
        print "\\timing off"
import re

begin_pattern = re.compile("^### BEGIN r_size = ([0-9]+), s_size = ([0-9]+), hit_ratio = 1, key_size = ([0-9]+), extra_cols = ([0-9]+)")
time_pattern = re.compile("^Time: ([0-9.]+) ms ")

def read_file(path):
  result = {}
  with open(path) as f:
    for line in f:
      groups = re.match(begin_pattern, line)
      if groups:
        key = groups.group(1, 2, 3, 4)
      groups = re.match(time_pattern, line)
      if groups:
        time = float(groups.group(1))
        if key in result:
          result[key].append(time)
        else:
          result[key] = [time]
  return result

master = read_file("test.master.out")
patched = read_file("test.patched.out")

r_size = 50000000
s_sizes = (1000, 100000, 10000000)
key_sizes = (0, 8, 16, 32, 64)
extra_colss = (0, 2, 4)

margin_width = 12
column_width = 21

print "".ljust(margin_width) + "".join(["s={:,}".format(s_size).ljust(column_width) for s_size in s_sizes])
print "".ljust(margin_width) + "".join(["=" * (column_width - 1) + " " for s_size in s_sizes])

for key_size in key_sizes:
  for extra_cols in extra_colss:
    if extra_cols == 0:
      if key_size == 0:
        line = "int"
      else:
        line = "text(%d)" % key_size
    else:
      line = " + %d cols" % extra_cols
    line = line.ljust(margin_width)
    for s_size in s_sizes:
      key = (str(r_size), str(s_size), str(key_size), str(extra_cols))
      if key in master and key in patched:
        master_times = sorted(master[key])
        patched_times = sorted(patched[key])
        master_worst_time = master_times[-1]
        patched_worst_time = patched_times[-1]
        master_median_time = master_times[1]
        patched_median_time = patched_times[1]
        cell = "w=%+.2f%%, m=%+.2f%%" % (((master_worst_time / patched_worst_time) - 1.0) * 100, ((master_median_time / patched_median_time) - 1.0) * 100)
      else:
        cell = ""
      line += cell.ljust(column_width)
    print line

Reply via email to