Ok, more questions. I've been studying the implementation Alexander Korotkov sent, and I'm not seeing how to map some of the components onto the changes in the SP-GiST system that occured between Postgresql 9.2 and 9.3.
The changes at that point seem to have been to change xxx_inner_consistent and xxx_leaf_consistent from taking an input containing a Datum pointing to the query conditional to a list of ScanKeys which each encapsulate one filter condition. The issue here is that for VP trees (or the BK tree I want to eventually implement), the filter condition requires two parameters: - The maximum allowed distance from the target value - The actual target value. The ScanKeys struct appears to only be able to contain a conditional type, and a single parameter, such as "less then *x*", "above *y*", and so forth. Looking at the existing spgkdtreeproc.c and spgquadtreeproc.c, their mechanism for passing more complex conditions through to the xxx_consistent functions appears to be to encapsulate the entire condition in a custom type. For example, their mechanism for querying if something is within a certain envelope is done by having the envelope be described by the "BOX" type. As such: Will compound queries as I describe above basically require a custom type to make it possible? My (admittedly naive) expectation is that the eventual query for this index will look something like "SELECT * FROM example_table WHERE indexed_column <=> target_value < 4;", with "<=>" being the operator for the relevant distance calculation (hamming, for the BK tree, numeric for the VP-tree). The existing VP-tree code appears to not support multiple operators whatsoever, probably because it was very preliminary. Thanks! Connor On Mon, Oct 30, 2017 at 7:04 PM, Connor Wolf < conn...@imaginaryindustries.com> wrote: > I was mostly unclear on how I'd go about attaching the extension functions > to the relevant indexing mechanism. From the looks of the vptree.tar.gz > file (which is really, *really* helpful, incidentally!), a it's done via a > custom operator class, which then gets passed to the actual index creation > mechanism when you're declaring the index. > > I think I had looked at that at one point, but it's been a while. In my > case, I'm using discrete-cosine-transform based perceptual hashes for > searching. They are nice and compact (64-bits per hash), while still > producing good search results. I have a dataset of ~36 million images, and > it does searches in < 50 milliseconds with a hamming distance of 4, while > touching ~0.25% of the tree (And occupying ~18 GB of ram). > > My BK tree is up on github here > <https://github.com/fake-name/IntraArchiveDeduplicator/blob/master/deduplicator/bktree.hpp>, > if anyone needs something like that (BSD licensed, pretty well tested). > There's also a python wrapper for it. > > I'll probably not have time to poke about until this weekend, but thanks! > Connor > > > > On Mon, Oct 30, 2017 at 4:50 AM, Alexander Korotkov < > a.korot...@postgrespro.ru> wrote: > >> Hi! >> >> On Sun, Oct 29, 2017 at 12:07 PM, Connor Wolf < >> w...@imaginaryindustries.com> wrote: >> >>> I'm looking at implementing a custom indexing scheme, and I've been >>> having trouble understanding the proper approach. >>> >>> Basically, I need a BK tree, which is a tree-structure useful for >>> indexing arbitrary discrete metric-spaces (in my case, I'm interested in >>> indexing across the hamming edit-distance of perceptual hashes, for fuzzy >>> image searching). I'm *pretty* sure a SP-GiST index is the correct >>> index type, as my tree is intrinsically unbalanced. >>> >> >> Yes, SP-GiST is appropriate index type for BK tree. I'm pretty sure BK >> tree could be implemented as SP-GiST opclass. >> The only thing worrying me is selection pivot values for nodes. SP-GiST >> builds by insertion of index tuples on by one. First pivot value for root >> node in SP-GIST would be created once first leaf page overflows. Thus, you >> would have to select this pivot value basing on very small fraction in the >> beginning of dataset. >> As I know, BK tree is most efficient when root pivot value is selected >> after looking in whole dataset and then hierarchically to subtrees. >> >> BTW, did you try my extension for searching similar images. It's quite >> primitive, but works for some cases. >> https://github.com/postgrespro/imgsmlr >> https://wiki.postgresql.org/images/4/43/Pgcon_2013_similar_images.pdf >> >> I have a functional stand-alone implementation of a BK-Tree, and it works >>> very well, but the complexity of managing what is basically a external >>> index for my database has reached the point where it's significantly >>> problematic, and it seems to be it should be moved into the database. >>> >> >> Sure, moving this index to the database is right decision. >> >> Anyways, looking at the contents of postgres/src/backend/access/spgist, >>> it looks pretty straightforward in terms of the actual C implementation, >>> but I'm stuck understanding how to "install" a custom SP-GiST >>> implementation. There are several GiST indexing implementations in the >>> contrib directory, but no examples for how I'd go about implementing a >>> loadable SP-GiST index. >>> >>> Basically, my questions are: >>> >>> - Is it possible to implement a SP-GiST indexing scheme as a >>> loadable module? >>> >>> Yes, it's possible to define SP-GiST. >> >>> >>> - If so, how? >>> >>> The pretty same way as GiST opclass extension. You have to define >> supporting functions and operators and then define operator class over them. >> >>> >>> - And is there an example I can base my implementation off of? >>> >>> >>> >>> I'm relatively comfortable with C (much moreso with C++), but I haven't >>> spent a lot of time looking at the postgresql codebase. I don't think I >>> could start from a empty folder and make a properly-implemented module in >>> any reasonable period of time, so if I have a working example for some sort >>> of index that uses the same interfaces that would really help a lot >>> >> I don't think there is an example in PostgreSQL source code tree or on >> github. But I've attached by early experiment with VP-tree (seems to be >> pretty same as BK tree) using SP-GiST (see vptree.tar.gz). Basing on this >> experiment I realized that it's important to select root pivot value basing >> on the whole dataset. However, for your metric/dataset/queries it might >> appear to be different. >> >> It also would be nice to someday improve SP-GiST to support some global >> strategies on index creation. In particular, it would allow to resolve >> selection of pivot values problem that I mention above. Right now my >> colleagues and me don't have time for that. But I can assist you with >> advises if you will decide to implement that. >> >> ------ >> Alexander Korotkov >> Postgres Professional: http://www.postgrespro.com >> The Russian Postgres Company >> >> >> -- >> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-hackers >> >> >