Re: [HACKERS] GSOC Student Project Idea

Heikki Linnakangas Wed, 08 May 2013 01:55:26 -0700

On 24.04.2013 14:31, Florian Pflug wrote:

On Apr23, 2013, at 23:25 , Alexander Korotkov<[email protected]>
wrote:

I've taken a brief look on the paper and implementation. As I can
see iDistance implements some global building strategy. I mean, for
example, it selects some point, calculates distances from selected
point to all points in dataset etc. So, it uses the whole dataset
at the same time.


However you can try to implement global index building in GiST or
SP-GiST. In this case I think you should carefully estimate your
capabilities during single GSoC project. You would need to extend
GiST or SP-GiST interface and write completely new implementation
of tree building algorithm. Question of how to exactly extend GiST
or SP-GiST interface for this case could appear to be very hard
even theoretically.


+1. That seemed to be a major roadblock to me too when I read the
paper.

You could work around that by making partition identification a
separate step. You'd have a function

idist_analyze(cfg name, table name, field name)

which'd identify suitable partitions for the data distribution in
table.field and store them somewhere. Such a set of pre-identified
partitions would be akin to a tsearch configuration, i.e. all other
parts of the iDistance machinery would use it to map points to index
keys and queries to ranges of those keys. You'll want to look at how
tsearch handles that, and check if the method can indeed be applied
to iDistance.

You could perform that step as part of the index build. Before the indexbuild starts to add tuples to the index, it could scan a random sampleof the heap and identify the partitions based on that.

If you need to store the metadata, like a map of partitions, it becomesdifficult to cajole this into a normal GiST or SP-GiST opclass. The APIdoesn't have any support for storing such metadata.

In a first cut, you'd probably only allow inserts into index which
don't change the maximum distances from the partition centers that
idist_analyze() found.

That seems like a pretty serious restriction. I'd try to write it sothat you can insert any value, but if the new values are very differentfrom any existing values, it would be OK for the index quality todegrade. For example, you could simply add any out-of-bounds values to aseparate branch in the index, which would have no particular structureand would just have to be scanned on every query. You can probably dobetter than that, but that would be a trivial way to deal with it.


- Heikki


--
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] GSOC Student Project Idea

Reply via email to