On Thu, Sep 7, 2017 at 11:26 AM, Peter Geoghegan <p...@bowt.ie> wrote:
> On Wed, Aug 30, 2017 at 9:29 AM, Peter Geoghegan <p...@bowt.ie> wrote:
>> On Wed, Aug 30, 2017 at 5:02 AM, Alvaro Herrera
>> <alvhe...@2ndquadrant.com> wrote:
>>> Eh, if you want to optimize it for the case where debug output is not
>>> enabled, make sure to use ereport() not elog().  ereport()
>>> short-circuits evaluation of arguments, whereas elog() does not.
>>
>> I should do that, but it's still not really noticeable.
>
> Since this patch has now bit-rotted, I attach a new revision, V2.
> Apart from fixing some Makefile bitrot, this revision also makes some
> small tweaks as suggested by Thomas and Alvaro. The documentation is
> also revised and expanded, and now discusses practical aspects of the
> set membership being tested using a Bloom filter, how that relates to
> maintenance_work_mem, and so on.
>
> Note that this revision does not let the Bloom filter caller use their
> own dynamic shared memory, which is something that Thomas asked about.
> While that could easily be added, I think it should happen later. I
> really just wanted to make sure that my Bloom filter was not in some
> way fundamentally incompatible with Thomas' planned enhancements to
> (parallel) hash join.

I have signed up as a reviewer of this patch, and I have looked at the
bloom filter implementation for now. This is the kind of facility that
people have asked for on this list for many years.

One first thing striking me is that there is no test for this
implementation, which would be a base stone for other things, it would
be nice to validate that things are working properly before moving on
with 0002, and 0001 is a feature on its own. I don't think that it
would be complicated to have a small module in src/test/modules which
plugs in a couple of SQL functions on top of bloomfilter.h.

+#define MAX_HASH_FUNCS 10
Being able to define the number of hash functions used at
initialization would be nicer. Usually this number is kept way lower
than the number of elements to check as part of a set, but I see no
reason to not allow people to play with this API in a more extended
way. You can then let your module decide what it wants to use.

+ * work_mem is sized in KB, in line with the general convention.
In what is that a general convention? Using bytes would be more
intuitive IMO.. Still I think that this could be removed, see below
points.

+/*
+ * Hash function is taken from sdbm, a public-domain reimplementation of the
+ * ndbm database library.
+ */
Reference link?

+   bitset_bytes = bitset_bits / BITS_PER_BYTE;
+   filter->bitset = palloc0(bitset_bytes);
+   filter->k_hash_funcs = optimal_k(bitset_bits, total_elems);
+   filter->seed = seed;
I think that doing the allocation within the initialization phase is a
mistake. Being able to use a DSA would be nice, but I predict as well
cases where a module may want to have a bloom filter that persists as
well across multiple transactions, so the allocation should be able to
live across memory contexts. What I think you should do instead to
make this bloom implementation more modular is to let the caller give
a pointer to a memory area as well as its size. Then what bloom_init
should do is to just initialize this area of memory with zeros. This
approach would give a lot of freedom. Not linking a bloom definition
to work_mem would be nice as well.

+   hashb = (filter->k_hash_funcs > 1 ? sdbmhash(elem, len) : 0);
I am wondering if it would make sense to be able to enforce the hash
function being used. The default does not look bad to me though, so we
could live with that.

+typedef struct bloom_filter bloom_filter;
Not allowing callers have a look at the structure contents is
definitely the right approach.

So, in my opinion, this bloom facility still needs more work.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to