This fix is orthogonal to using different hash functions for each join
level. We also need to use a different hash function for level 0 (even when
there is no spilling at all) in order to avoid hash collisions caused by
the hash partitioning operator.

On Sat, Sep 26, 2020 at 5:26 PM Mike Carey <[email protected]> wrote:

> I vaguely remember a similar performance bug years ago -- maybe that was
> w.r.t. the hash function(s) used in the series of levels when joining
> really big data.  There I think we ended up moving through a
> family/series of hash functions.  Does this fix do the same...?  (Just
> wondering if there's already other infrastructure in the code for this.)
>
> On 9/26/20 10:15 AM, Chen Luo wrote:
> > Hi team,
> >
> > Recently I found a serious performance bug for the current hash
> > join/groupby implementation [1] and submitted a fix for it [2]. The
> damage
> > caused by this bug depends on the query itself and the underlying data,
> but
> > for a specific join query TPC-H Q5 whose initial query time was very
> > strange, *the query time was reduced from ~3500s to ~1000s on 16 nodes
> > after applying this fix*.
> >
> > When we perform hash join/groupby, keys are first hash partitioned into
> > each NC partition (hash1(key)%P, where P is the number of NC partitions).
> > Within each NC partition, a hash table is built by hashing each key to a
> > slot (hash2(key)%N, where N is the number of slots). The key problem is
> > that we used the same hash function for both hash1 and hash2! In general,
> > this may lead to a lot of hash collisions. To see this problem, consider
> > what happens with NC partition 0. We know that all keys assigned to NC
> > partition 0 must satisfy hash(key) % P == 0. Now suppose we have 16 NC
> > partitions (P = 16) and N is a multiple of 4. Since hash(key)%16 == 0, we
> > know that hash(key)%N must be a multiple of 4! This means all slots that
> > are multiple of 1,2,3 will be empty, and all keys will be clustered into
> > slots that are multiples of 4. To fix this problem, we can simply use a
> > different hash function for hash join/groupby.
> >
> > If you are running experiments related to join queries and have seen some
> > unexpected performance results, it'll be helpful to try this fix to see
> > what happens.
> >
> > Best regards,
> > Chen Luo
> >
> > [1] https://issues.apache.org/jira/browse/ASTERIXDB-2783
> > [2] https://asterix-gerrit.ics.uci.edu/c/asterixdb/+/8123
> >
>

Reply via email to