Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/15511 )
Change subject: WIP IMPALA-9434: Implement Robin Hood Hash Table. ...................................................................... Patch Set 3: (6 comments) http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@12 PS3, Line 12: robin hood algorithm to keep probe function of hash-table intact. It's worth noting that this isn't the full robin-hood hashing approach, because we're not short-circuiting the lookup based on the invariant (i.e. once the max distance of a hash value you'd seen on the probe exceeds your current distance, you know the key cannot be in the hash table). I think this may be a problem if you do benchmarks where most of the lookups are not present in the hash table. This might be significant for some workloads. But I guess for hash joins, this will give most of the benefit, if we assume that missing values are mostly filtered out by bloom filters in the scan. So if we see a big impact, we can look at optimising it further. http://gerrit.cloudera.org:8080/#/c/15511/3//COMMIT_MSG@14 PS3, Line 14: Instead of proactively swapping elements during insertion, the I guess we can revisit this to see what the impact is, it seems like it probably means that insert is somewhat more expensive because of the two passes. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h File be/src/exec/hash-table.inline.h: http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@115 PS3, Line 115: target_distance == 0 || Why do we need this condition? My understanding was that you always want to maintain the invariant that the richer key gets bumped. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@138 PS3, Line 138: if (curr_bucket->filled) { Is it possible for this to be false? Why would there be an empty bucket *before* the current position of the key (given that we don't support deletion from the hash table). I.e. Maybe this should be a DCHECK to enforce the invariant. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@162 PS3, Line 162: inline int64_t HashTable::BucketDistance( I guess calculating this for linear probing is relatively straightforward, but not sure if there's a good way to calculate it for quadratic probing (quadratic probing + robin hood hashing might not be a great combination). We did the quadratic probing because of a tendency for the CRC-based hash to cluster. I don't know if robin-hood hashing is enough to counter-act that tendency. http://gerrit.cloudera.org:8080/#/c/15511/3/be/src/exec/hash-table.inline.h@374 PS3, Line 374: table_->PrepareBucketForInsert(bucket_idx_, hash); You could also rebalance there, I think, to benefit the agg. Or combine the rebalancing with PrepareBucketForInsert -- To view, visit http://gerrit.cloudera.org:8080/15511 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28eeccd7f9ccae39e31972391f971901bcbfe986 Gerrit-Change-Number: 15511 Gerrit-PatchSet: 3 Gerrit-Owner: Riza Suminto <[email protected]> Gerrit-Reviewer: David Rorke <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Riza Suminto <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Comment-Date: Tue, 24 Mar 2020 16:23:27 +0000 Gerrit-HasComments: Yes
