[HACKERS] Designing an extension for feature-space similarity search

Jay Levitt Wed, 15 Feb 2012 12:35:30 -0800

[Preamble: I've been told that the hackers list is appropriate forextension-related topics like this, even if it's not about contributing tocore. If I'm misappropriating, please let me know.]


Goal: Personalized, context-relevant query results

We are building a deeply personalized site; think "OKCupid for productrecommendations" or "Pinterest for people with your tastes". We use psychresearch to measure and predict your personality and traits along a numberof scales (dimensions), and then we connect you with people, products andcontent we think you'll like.


I won't go into the design history, but you can read a little here:

http://parapoetica.wordpress.com/2012/02/15/feature-space-similarity-search-in-postgresql/

Suffice to say, this ends up needing something like KNN-GiST cubes, only:

- The overall concept is more like N-dimensional vectors than cubes
- But a dimension might be in any domain, not just floats
- All vectors have the same number of dimensions with the same meanings
- The distance along each dimension is a domain-specific function
- NULLs are allowed (the distance function will handle the semantics)

- The distance between two vectors is a function that aggregates thedistances of each dimension, along with arbitrary other arguments - forinstances, it might take the weighted average of the dimensions

That aggregation (which may not literally be an aggregate; I'm not sure yet)needs to happen in a SELECT list, which means it needs to be fast, whichmeans all this (or at least much of it) has to be C.

The "simplest thing that works" is probably to hack up the cube extension,declare that everything (except inner pages) must be a zero-volume cube(cube_is_point()), map our non-float features onto floats somehow, andhard-code all the distance functions and the aggregation function.

But I think this sort of similarity-search engine has general utility, and Ialso want to make it easy for us to add and subtract dimensions without toomuch pain; that should be DDL, not code. So thinking about how this mightevolve...

- I'm not sure how to represent arbitrary column-like features withoutreinventing the wheel and putting a database in the database. hstore onlystores text, probably for this reason; I took a look at the earlier jsonpatch and saw that it handled only a few core data types. Have there beenany other PoCs that involved collections of hetereogenous data? I almostwant an actual instance of an "anyarray".

- Alternatively, is there a way to index an entire, arbitrary row, ratherthan on a column on that row? I'm fine with this extension requiring its owntable, so I leave the data where it is in the row, and only worry aboutindexing it. I can't just use functional indexes, because I'll need toprovide operators and support functions to GiST. Maybe I have a fakesentinel column, where all the operators use SPI to introspect the row,treat each column as a feature dimension, call the underlying operators oneach column's data type, etc.


- Can domains have operators, or are operators defined on types?

- Does KNN-GiST run into problems when <-> returns values that don't "makesense" in the physical world? For instance, let's say NULL <-> NULL returnsa distance of 1.0. That means that NULL1 <-> NULL2 = 1.0, and NULL2 <->NULL3 = 1.0, but NULL1 <-> NULL3 = 1.0 as well. I think that's fine - thatcould even describe a triangle - but my spidey sense is tingling on this.

- Are there previous discussions, patches, abandoned projects, etc. thatthis reminds you of and that I should go research?

Thanks for any thoughts, and I'd love collaborators or even mentors - weplan to open source whatever we produce here, and I don't have quite thetheoretical background it takes to do this properly.


Jay Levitt

--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Designing an extension for feature-space similarity search

Reply via email to