# [HACKERS] Designing an extension for feature-space similarity search

[Preamble: I've been told that the hackers list is appropriate for extension-related topics like this, even if it's not about contributing to core. If I'm misappropriating, please let me know.]
```
Goal: Personalized, context-relevant query results```
```
```
We are building a deeply personalized site; think "OKCupid for product recommendations" or "Pinterest for people with your tastes". We use psych research to measure and predict your personality and traits along a number of scales (dimensions), and then we connect you with people, products and content we think you'll like.
```
I won't go into the design history, but you can read a little here:

http://parapoetica.wordpress.com/2012/02/15/feature-space-similarity-search-in-postgresql/

Suffice to say, this ends up needing something like KNN-GiST cubes, only:

- The overall concept is more like N-dimensional vectors than cubes
- But a dimension might be in any domain, not just floats
- All vectors have the same number of dimensions with the same meanings
- The distance along each dimension is a domain-specific function
- NULLs are allowed (the distance function will handle the semantics)
```
- The distance between two vectors is a function that aggregates the distances of each dimension, along with arbitrary other arguments - for instances, it might take the weighted average of the dimensions
```
```
That aggregation (which may not literally be an aggregate; I'm not sure yet) needs to happen in a SELECT list, which means it needs to be fast, which means all this (or at least much of it) has to be C.
```
```
The "simplest thing that works" is probably to hack up the cube extension, declare that everything (except inner pages) must be a zero-volume cube (cube_is_point()), map our non-float features onto floats somehow, and hard-code all the distance functions and the aggregation function.
```
```
But I think this sort of similarity-search engine has general utility, and I also want to make it easy for us to add and subtract dimensions without too much pain; that should be DDL, not code. So thinking about how this might evolve...
```
```
- I'm not sure how to represent arbitrary column-like features without reinventing the wheel and putting a database in the database. hstore only stores text, probably for this reason; I took a look at the earlier json patch and saw that it handled only a few core data types. Have there been any other PoCs that involved collections of hetereogenous data? I almost want an actual instance of an "anyarray".
```
```
- Alternatively, is there a way to index an entire, arbitrary row, rather than on a column on that row? I'm fine with this extension requiring its own table, so I leave the data where it is in the row, and only worry about indexing it. I can't just use functional indexes, because I'll need to provide operators and support functions to GiST. Maybe I have a fake sentinel column, where all the operators use SPI to introspect the row, treat each column as a feature dimension, call the underlying operators on each column's data type, etc.
```
- Can domains have operators, or are operators defined on types?

```
- Does KNN-GiST run into problems when <-> returns values that don't "make sense" in the physical world? For instance, let's say NULL <-> NULL returns a distance of 1.0. That means that NULL1 <-> NULL2 = 1.0, and NULL2 <-> NULL3 = 1.0, but NULL1 <-> NULL3 = 1.0 as well. I think that's fine - that could even describe a triangle - but my spidey sense is tingling on this.
```
```
- Are there previous discussions, patches, abandoned projects, etc. that this reminds you of and that I should go research?
```
```
Thanks for any thoughts, and I'd love collaborators or even mentors - we plan to open source whatever we produce here, and I don't have quite the theoretical background it takes to do this properly.
```
Jay Levitt

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
```