Hi,

On 01/10/2016 04:03 AM, Peter Geoghegan wrote:
On Sat, Jan 9, 2016 at 4:08 PM, Peter Geoghegan <p...@heroku.com> wrote:
Also, have you considered Hash join conditions with multiple
attributes as a special case? I'm thinking of cases like this:

Sorry, accidentally fat-fingered my enter key before I was finished
drafting that mail. That example isn't useful, because there is no
early/cheap elimination of tuples while scanning the outer relation.

My point about bloom filters on only one attribute (or perhaps
multiple bloom filters on multiple attributes) is that you might be
able to make the bloom filter more effective, particularly relative
to the amount of memory available, by only building it for one
attribute where that distinction (one attribute vs multiple) happens
to exist.

I'm not sure what you mean. The patch builds bloom filter on hashvalue, not the attribute values directly. I find this very efficient as we don't have to hash the original values again and instead work with just a short 32-bit hashvalue.

I really doubt doing some magic by only building the bloom filter on one of the attributes is useful - it's way more complicated, requires estimating the ndistinct for each attribute, and also somehow estimate the impact on the accuracy of the bloom filter. That seems rather complex and I don't have a good idea how to do that.


Simon said: "So the objective must be to get a Bloom Filter that is
small enough that it lives in a higher/faster level of cache than the
main Hash table". I agree that that's really important here, although
I did note that Tomas wasn't so sure, emphasizing the importance of
avoiding serialization and deserialization overhead -- and certainly,
that's why the patch helps some cases a lot.

I think the impact of the bloom filter size really depends on the size of the hash join. I see three cases, as indicated in the benchmarks I posted

(a) single batch (1M:10M)

    - fitting into CPU cache really important (otherwise almost no
      improvement)

    - impossible to speed up for very small hash tables (that fit into
      L3 cache)

(b) multiple batches w. medium data set (10M:100M)

    - important, but not as much as for (a), as we still eliminate the
      (de)serialization overhead

(c) large data set (5M:250M)

    - not important, it's mostly about eliminating (de)serialization
      overhead and/or I/O overhead

> But Simon's point applies more to the worst case than the best or
> average cases, and evidently we have plenty of worrying about the
> worst case left to do here.

So what do you consider to be the worst case?

So, one attribute could have relatively low cardinality, and thus
fewer distinct items in the bloom filter's set, making it far slower
to degrade (i.e. have increased probability of false positives) as
compared to a composite of two or more attributes, and yet perhaps
not significantly less useful in terms of its ability to eliminate
(de)serialization overhead. Or, with two bloom filters (one per
attribute), one bloom filter could degrade far faster than the other
(due to having more distinct values), often very unpredictably, and
yet it wouldn't matter because we'd discard the bad one (maybe
really early, during the scan of the inner relation, using HLL).

I don't think so.

Firstly, I don't know what you mean by "slower to degrade" as the false positive rate of a bloom filter does not depend on the cardinality of the column used to build the bloom filter, but rather on the "load factor" of the bloom filter (i.e. number of items added to the filter compared to the initial capacity). Not to mention that you still have to estimate the cardinality of the columns, which is tricky (and one of the main issues of the current patch, IMHO).

Secondly, by splitting the "composite" bloom filter into per-column filters you make it impossible to track correlation between the columns.

For example let's say we have two attributes (a,b), 'a' having 1000 distinct values while 'b' only has 10. But let's assume that there's a correlation between 'a' and 'b' so that (mod(a,10) = b). If you build the bloom filter on the columns separately, it's going to be entirely useless. Sure, this is a rather trivial (and perhaps naive) example, but it shows how easy it's to break it.

I'd also like to point out that vast majority of joins is on a single column, so perhaps we should try to solve those first, and then perhaps improve the multi-column case if possible.


I notice that all these example queries involve less-than operators
(inner relation tuples are thereby prevented from being entered into
the hash table). It might not be a terrible idea to hash abbreviated
keys (or even truncated abbreviated keys) for the bloom filter in
certain cases, in particular when there is a correlation in the inner
relation attributes (equi-joined attribute, and attribute that is
other part of qual). There might not be a correlation, of course, but
with some care hashing abbreviated keys may have virtually no
additional overhead, making it worth doing despite the general
uncertainty about any correlation existing. If you're going to accept
that the bloom filter can't be very large, which I think you must,
then hashing an entire value may be no better than hashing an
abbreviated key (those 8 bytes tend to have a lot of entropy). This
point is rather hand-wavy, though.

I have to admit that I have zero idea on how to do that. Also, there's nothing really special about the '<' operator - the reason why I used it is that it makes it very simple to filter arbitrary fraction of the table (i.e. specify how many tuples of the outer relation have a matching tuple in the hash table).

I'm not saying we can't apply the abbreviated keys somehow, but at this point it seems rather like a hammer looking for a nail.


I suspect that BLOOM_ERROR_RATE isn't the most useful constraint
here. Is the degree to which it helps in sympathetic cases at all
commensurate with the degree to which it hurts in unsympathetic
cases? In your "low selectivity" cases (the sympathetic cases), there
are relatively few values represented in the main hash table, and
therefore in the bloom filter, and so a smaller bloom filter results
automatically anyway. But in the bad cases, the BLOOM_ERROR_RATE
constraint just wastes memory bandwidth for no gain, I think. If the
number of elements is so large that a reasonable fixed sized bloom
filter has a poor false positive rate anyway, we may have already
lost.

I agree that having a fixed error rate is somewhat annoying. What I imagined would be something like this:

   (1) estimate the number of distinct values the bloom filter

   (2) see what error rate would make the bloom filter "small enough"
       (so that it fits into L3)

   (3) see if the error rate makes the bloom filter efficient (some
       simple costing, or perhaps hard-coded threshold)

This however requires knowledge of the L3 cache size for (2), or rather how much of it is available for the process (which is tricky to get). And then (3) requires some costing function, which seems tricky too.


My sense is that we should try to make the bloom filter as cheap as
possible more so than trying to make sure it's useful before much
work has been done.

Well, yeah. Fail fast. But how?

Is it okay that you don't treat MCV/skew values as special? Actually,
looking at this code, I think I notice something:

@@ -238,6 +238,15 @@ ExecHashJoin(HashJoinState *node)
                                                                  hashvalue);
                 node->hj_CurTuple = NULL;

+               /* If still in the first batch, we check the bloom filter. */
+               if ((hashtable->curbatch == 0) &&
+                   (! ExecHashBloomCheckValue(hashtable, hashvalue)))
+               {
+                       /* no matches; check for possible outer-join fill */
+                       node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
+                       continue;
+               }

Haven't we already established by now that the value of
"node->hj_CurSkewBucketNo" may not be INVALID_SKEW_BUCKET_NO, and so
the value certainly was found in the inner relation scan? Why bother
doing anything with the bloom filter in that common case? (Note that
I'm not suggesting that we don't establish "node->hj_CurSkewBucketNo"
first).

Hmmmm, maybe. The bloom filter should contain all values, including those from skew buckets. So what the code could / should do is check the bloom filter first, and only do ExecHashGetSkewBucket() if we still think the tuple may be in the hash table. So something like this:


    node->hj_CurHashValue = hashvalue;
    node->hj_CurTuple = NULL;

    /* If still in the first batch, we check the bloom filter. */
    if ((hashtable->curbatch == 0) &&
        ExecHashBloomCheckValue(hashtable, hashvalue)))
    {
        /* no matches; check for possible outer-join fill */
        node->hj_JoinState = HJ_FILL_OUTER_TUPLE;
        continue;
    }

    ExecHashGetBucketAndBatch(hashtable, hashvalue,
                              &node->hj_CurBucketNo, &batchno);
    node->hj_CurSkewBucketNo = ExecHashGetSkewBucket(hashtable,
                                                     hashvalue);

Not sure how expensive those two functions (ExecHashGetBucketAndBatch and ExecHashGetSkewBucket) are, but this seems like a good idea. Haven't tried that, though, so maybe HJ_FILL_OUTER_TUPLE needs that info or something like that.

Makes sense?


Also, are you aware of this?

http://www.nus.edu.sg/nurop/2010/Proceedings/SoC/NUROP_Congress_Cheng%20Bin.pdf

It talks about bloom filters for hash joins in PostgreSQL
specifically. Interestingly, they talk about specific TPC-H queries.

Interesting. The way that paper uses bloom filters is very different from what I do in the patch. They build the bloom filters and then propagate them into the scan nodes to eliminate the tuples early.

The problem is they'd be facing mostly the same problems with sizing the bloom filter I'm facing in the patch. The only way they managed to get around them is that they used a dataset with fixed size (1GB, which is a bit small), and bloom 1kB or 8kB bloom filters.

I suspect they'd get much worse results with datasets of "reasonable" size (say, 100GB or more) because the bloom fixed-size filters would stop being effective.


BTW, I think code like this needs better comments:

Yeah.


+static BloomFilter
+BloomFilterInit(double nrows, double error)
+{
+   /* perhaps we should round nbits to multiples of 8 ? */
+   int nbits = ceil((nrows * log(error)) / log(1.0 / (pow(2.0, log(2.0)))));
+   int nhashes = round(log(2.0) * nbits / nrows);
+
+   BloomFilter filter = palloc0(offsetof(BloomFilterData, data) +
((nbits + 7) / 8));
+
+   filter->nbits = nbits;
+   filter->nhashes = nhashes;

Have you experimentally verified that you get the expected probability
of false positives in practice?

I have seen that when the bloom filter is properly sized, and the data set is not somehow skewed, the bloom filters have about the right false positive rate.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to