On Sun, Apr 26, 2020 at 02:46:19PM -0400, James Coleman wrote:
On Sat, Apr 25, 2020 at 8:31 PM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
On Sat, Apr 25, 2020 at 06:47:41PM -0400, James Coleman wrote:
>On Sat, Apr 25, 2020 at 5:41 PM David Rowley <dgrowle...@gmail.com> wrote:
>>
>> On Sun, 26 Apr 2020 at 00:40, Tomas Vondra <tomas.von...@2ndquadrant.com>
wrote:
>> > This reminds me our attempts to add bloom filters to hash joins, which I
>> > think ran into mostly the same challenge of deciding when the bloom
>> > filter can be useful and is worth the extra work.
>>
>> Speaking of that, it would be interesting to see how a test where you
>> write the query as IN(VALUES(...)) instead of IN() compares. It would
>> be interesting to know if the planner is able to make a more suitable
>> choice and also to see how all the work over the years to improve Hash
>> Joins compares to the bsearch with and without the bloom filter.
>
>It would be interesting.
>
>It also makes one wonder about optimizing these into to hash
>joins...which I'd thought about over at [1]. I think it'd be a very
>significant effort though.
>
I modified the script to also do the join version of the query. I can
only run it on my laptop at the moment, so the results may be a bit
different from those I shared before, but it's interesting I think.
In most cases it's comparable to the binsearch/bloom approach, and in
some cases it actually beats them quite significantly. It seems to
depend on how expensive the comparison is - for "int" the comparison is
very cheap and there's almost no difference. For "text" the comparison
is much more expensive, and there are significant speedups.
For example the test with 100k lookups in array of 10k elements and 10%
match probability, the timings are these
master: 62362 ms
binsearch: 201 ms
bloom: 65 ms
hashjoin: 36 ms
I do think the explanation is fairly simple - the bloom filter
eliminates about 90% of the expensive comparisons, so it's 20ms plus
some overhead to build and check the bits. The hash join probably
eliminates a lot of the remaining comparisons, because the hash table
is sized to have one tuple per bucket.
Note: I also don't claim the PoC has the most efficient bloom filter
implementation possible. I'm sure it could be made faster.
Anyway, I'm not sure transforming this to a hash join is worth the
effort - I agree that seems quite complex. But perhaps this suggest we
should not be doing binary search and instead just build a simple hash
table - that seems much simpler, and it'll probably give us about the
same benefits.
That's actually what I originally thought about doing, but I chose
binary search since it seemed a lot easier to get off the ground.
OK, that makes perfect sense.
If we instead build a hash is there anything else we need to be
concerned about? For example, work mem? I suppose for the binary
search we already have to expand the array, so perhaps it's not all
that meaningful relative to that...
I don't think we need to be particularly concerned about work_mem. We
don't care about it now, and it's not clear to me what we could do about
it - we already have the array in memory anyway, so it's a bit futile.
Furthermore, if we need to care about it, it probably applies to the
binary search too.
I was looking earlier at what our standard hash implementation was,
and it seemed less obvious what was needed to set that up (so binary
search seemed a faster proof of concept). If you happen to have any
pointers to similar usages I should look at, please let me know.
I think the hash join implementation is far too complicated. It has to
care about work_mem, so it implements batching, etc. That's a lot of
complexity we don't need here. IMO we could use either the usual
dynahash, or maybe even the simpler simplehash.
FWIW it'd be good to verify the numbers I shared, i.e. checking that the
benchmarks makes sense and running it independently. I'm not aware of
any issues but it was done late at night and only ran on my laptop.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services