On 10/5/21 18:28, Simon Riggs wrote:
On Tue, 5 Oct 2021 at 12:24, Dilip Kumar <dilipbal...@gmail.com> wrote:

On Tue, Oct 5, 2021 at 4:08 PM Simon Riggs <simon.ri...@enterprisedb.com> wrote:

On Mon, 27 Sept 2021 at 06:52, Amit Kapila <amit.kapil...@gmail.com> wrote:

On Thu, Sep 23, 2021 at 11:11 AM Dilip Kumar <dilipbal...@gmail.com> wrote:

On Thu, Sep 23, 2021 at 10:04 AM Sadhuprasad Patro <b.sa...@gmail.com> wrote:

And to get the multi-column hash index selected, we may set
enable_hashjoin =off, to avoid any condition become join condition,
saw similar behaviors in other DBs as well...

This may be related to Tom's point that, if some of the quals are
removed due to optimization or converted to join quals, then now, even
if the user has given qual on all the key columns the index scan will
not be selected because we will be forcing that the hash index can
only be selected if it has quals on all the key attributes?

I don't think suggesting enable_hashjoin =off is a solution,


Yeah, this doesn't sound like a good idea. How about instead try to
explore the idea where the hash (bucket assignment and search) will be
based on the first index key and the other columns will be stored as
payload? I think this might pose some difficulty in the consecutive
patch to enable a unique index because it will increase the chance of
traversing more buckets for uniqueness checks. If we see such
problems, then I have another idea to minimize the number of buckets
that we need to lock during uniqueness check which is by lock chaining
as is used during hash bucket clean up where at a time we don't need
to lock more than two buckets at a time.

I have presented a simple, almost trivial, patch to allow multi-col
hash indexes. It hashes the first column only, which can be a downside
in *some* cases. If that is clearly documented, it would not cause
many issues, IMHO. However, it does not have any optimization issues
or complexities, which is surely a very good thing.

Trying to involve *all* columns in the hash index is a secondary
optimization. It requires subtle changes in optimizer code, as Tom
points out. It also needs fine tuning to make the all-column approach
beneficial for the additional cases without losing against what the
"first column" approach gives.

I did consider both approaches and after this discussion I am still in
favour of committing the very simple "first column" approach to
multi-col hash indexes now.

But what about the other approach suggested by Tom, basically we hash
only based on the first column for identifying the bucket, but we also
store the hash value for other columns?  With that, we don't need
changes in the optimizer and we can also avoid a lot of disk fetches
because after finding the bucket we can match the secondary columns
before fetching the disk tuple.  I agree, one downside with this
approach is we will increase the index size.

Identifying the bucket is the main part of a hash index's work, so
that part would be identical.

Once we have identified the bucket, we sort the bucket page by hash,
so having an all-col hash would help de-duplicate multi-col hash
collisions, but not enough to be worth it, IMHO, given that storing an
extra 4 bytes per index tuple is a significant size increase which
would cause extra overflow pages etc.. The same thought applies to
8-byte hashes.


IMO it'd be nice to show some numbers to support the claims that storing the extra hashes and/or 8B hashes is not worth it ...

I'm sure there are cases where it'd be a net loss, but imagine for example a case when the first column has a lot of duplicate values. Which is not all that unlikely - duplicates seem like one of the natural reasons why people want multi-column hash indexes. And those duplicates are quite expensive, due to having to access the heap. Being able to eliminate those extra accesses cheaply might be a clear win, even if it makes the index a bit larger (shorter hashes might be enough).


IMHO, multi-column hash collisions are a secondary issue, given that
we can already select the column order for an index and hash indexes
would only be used by explicit user choice. If there are some minor
sub-optimal aspects of using hash indexes, then btree was already the
default and a great choice for many cases.


But we can't select arbitrary column order, because the first column is used to select the bucket. Which means it's also about what columns are used by the queries. If the query is not using the first column, it can't use the index.


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Reply via email to