Re: [HACKERS] [GENERAL] Understanding and implementing a GiST Index

Connor Wolf Fri, 10 Oct 2014 10:50:14 -0700

I'd be ok with either SP-GiST or GiST, though there are tree structures(BK tree) that are particularly suited to this application that I/think/ map to GiST better then SP-GiST.

Re: hamming in the RD-tree implementation: Where? I cloned the smlar gitrepo, and poked around, and it looks like it only can operate on setintersections, which is a non-starter for what I need to do. I spent awhile looking through the source, and I didn't see anything that lookedlike hamming distance calculation (through I probably missed some stuff.The indirection through `FCall2()` makes things very hard to follow).

Anyways, right now, smlar is not useful to me, because it completelyignores the structure of incoming arrays:


postgres=# SELECT smlar(
postgres(# '{1,0,0,0, 1,1,1,1, 1,1,1,0, 0,1,0,0}'::int[],
postgres(# '{1,0,0,1, 0,1,0,0, 0,1,0,0, 0,1,1,0}'
postgres(# );
 smlar
-------
     1
(1 row)

For two radically different hashes (shortened for example purposes)which compare as identical.

I spent some time trying to see if I could easily disable the arrayuniquefication, and by disabling the calls to `qsort_arg`, and modifying`numOfIntersect` so it just counts the number of times the arrayelements do not match, and I'm getting the behaviour I want out of`smlar()` at this point:


postgres=# SELECT smlar(
postgres(# '{1,0,0,0, 1,1,1,1, 1,1,1,0, 0,1,0,0}'::int[],
postgres(# '{1,0,0,1, 0,1,0,0, 0,1,0,0, 0,1,1,0}'
postgres(# );
 smlar
--------
 0.4375
(1 row)

postgres=# SELECT smlar(
'{0,1,1,0}'::int[],
'{1,0,1,0}'
);
 smlar
-------
   0.5
(1 row)

But the index does not seem to work after the changes I made, and thedebugging print statements (well, `elog()` statements) I inserted into`cmpArrayElem()` and `numOfIntersect()` are not being hit, so I don'tunderstand how the index is even being built.


Anyways, I'm rambling a bit. Any input would be great.
Thanks,
Connor

On 10/9/2014 8:11 AM, Oleg Bartunov wrote:

Just quick recommendation.

You need to ask -hackers mailing list for that. Also, do you want GiSTor SP-GiST ?We already use hamming distance in rd-tree, which implemented in GiST,see our presentationhttp://www.sai.msu.su/~megera/postgres/talks/pgcon-2012.pdf<http://www.sai.msu.su/%7Emegera/postgres/talks/pgcon-2012.pdf>, forexample.


Oleg

On Thu, Oct 9, 2014 at 11:09 AM, Connor Wolf<[email protected] <mailto:[email protected]>>wrote:


    Hi there!

    I'm trying to implement a custom indexing system using the GiST
    index framework, and I have a few questions.
    Basically, I'm trying to implement a system that allows me to
    search across a set of 64 bit integers by hamming distance. This
    is intended to be used in searching for similar images, where the
    64 bit integer is actually a phash
    <http://www.hackerfactor.com/blog/?/archives/432-Looks-Like-It.html>
    of an image, and similarity between two images is reflected in the
    hamming distance between two integers.

    Anyways, The appropriate approach here is to use something called
    a BK tree
    <http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees>,
    for which I've written some test code
    
<https://github.com/fake-name/MangaCMS/blob/master/deduplicator/hamDb.py#L27>
    and I think I have a decent grip of (my test code seems to work,
    in any event).

    That said:

      * Is there a /simple/ piece of example-code for implementing a
        GiST index for a single index? I've been looking through the
        /contrib/ directory, particularly the /contrib/btree_gist/
        files, but it looks like 90% of the complexity there is
        related to supporting all the column types Postgres has,
        rather then actually tied to producing a functional index.
      * Once I have something compiling, how can I check to be sure
        that I'm actually using the indexing module I created? I can
        enable my compiled extension using `CREATE EXTENSION`, but is
        there an option for `CREATE INDEX test_index ON testing USING
        gist (val);` that lets me specify *which* GiST index is
        actually employed? Is this even a valid question?
        This is probably something that's available in one of the
        system tables somewhere, but I haven't had much luck with
        google finding out where.
      * Testing: What's the appropriate way to examine the generated
        tree structure of the index? I certainly went through a few
        bugs with my test tree system (and that was in python!). Is
        there any way to examine the index structure for debugging
        purposes?
      * Also, it looks like `ereport()` is the proper way to emit
        debugging information. Is this correct?
      * In that vein, is there any way to have information that is on
        the operations of an entire query? Looking at the number of
        tree nodes touched for a scan would be nice (and I would not
        be surprised if there is already a facility for it).

    Project code is here
    <https://github.com/fake-name/pg-spgist_hamming> if anyone is
    interested, any help would be great. I have very little idea what
    I'm doing.

    Thanks,
    Connor

Re: [HACKERS] [GENERAL] Understanding and implementing a GiST Index

Reply via email to