Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Heikki Linnakangas Mon, 04 Jul 2016 02:31:44 -0700

On 18/03/16 19:19, Anastasia Lubennikova wrote:

Please, find the new version of the patch attached. Now it has WAL
functionality.


Detailed description of the feature you can find in README draft
https://goo.gl/50O8Q0

This patch is pretty complicated, so I ask everyone, who interested in
this feature,
to help with reviewing and testing it. I will be grateful for any feedback.
But please, don't complain about code style, it is still work in progress.

Next things I'm going to do:
1. More debugging and testing. I'm going to attach in next message
couple of sql scripts for testing.
2. Fix NULLs processing
3. Add a flag into pg_index, that allows to enable/disable compression
for each particular index.
4. Recheck locking considerations. I tried to write code as less
invasive as possible, but we need to make sure that algorithm is still
correct.
5. Change BTMaxItemSize
6. Bring back microvacuum functionality.

I think we should pack the TIDs more tightly, like GIN does with thevarbyte encoding. It's tempting to commit this without it for now, andadd the compression later, but I'd like to avoid having to deal withmultiple binary-format upgrades, so let's figure out the final on-diskformat that we want, right from the beginning.

It would be nice to reuse the varbyte encoding code from GIN, but wemight not want to use that exact scheme for B-tree. Firstly, animportant criteria when we designed GIN's encoding scheme was to avoidexpanding on-disk size for any data set, which meant that a TID had toalways be encoded in 6 bytes or less. We don't have that limitation withB-tree, because in B-tree, each item is currently stored as a separateIndexTuple, which is much larger. So we are free to choose an encodingscheme that's better at packing some values, at the expense of usingmore bytes for other values, if we want to. Some analysis on what wewant would be nice. (It's still important that removing a TID from thelist never makes the list larger, for VACUUM.)

Secondly, to be able to just always enable this feature, without a GUCor reloption, we might need something that's faster for random accessthan GIN's posting lists. Or can we just add the setting, but it wouldbe nice to have some more analysis on the worst-case performance beforewe decide on that.

I find the macros in nbtree.h in the patch quite confusing. They'resimilar to what we did in GIN, but again we might want to choosedifferently here. So some discussion on the desired IndexTuple layout isin order. (One clear bug is that using the high bit of BlockNumber forthe BT_POSTING flag will fail for a table larger than 2^31 blocks.)


- Heikki



--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [WIP] Effective storage of duplicates in B-tree index.

Reply via email to