Re: [HACKERS] Cube extension improvement, GSoC

Alexander Korotkov Sat, 30 Mar 2013 04:56:29 -0700

Hi, Jey!

I remember I've started couple of threads related to cube extension:
http://www.postgresql.org/message-id/[email protected]
http://www.postgresql.org/message-id/[email protected]
Could you provide some feedback to GSoC proposal of Stas?


On Sat, Mar 23, 2013 at 3:10 AM, Stas Kelvich <[email protected]> wrote:

> Hello,
>
> some time ago I started working on the data search system (about 100-200M
> of records) with queries consisted of several diapason and equality
> conditions, e.g.:
>
>   WHERE dim1 BETWEEN 128 AND 137 AND
>   WHERE dim2 BETWEEN 4815 AND 162342 AND
>   WHERE dim3 = 42
>   ORDER BY dim1 ASC
>
> There are 6 or 7 search criteria in my task. In order to avoid full table
> scan I started using R-Tree over cube data type:
>
>   CREATE INDEX ON my_table USING GiST(cube(array[dim1, dim2, dim3]))
>
> For fast sorting I used Alexander Korotkov's patch for knngist (
> http://www.mail-archive.com/[email protected]/msg153676.html),
> which changes metric for nearest neighbors search and allows to obtain
> cubes ordered by a specific coordinate. Having changed some data types and
> function/operator definitions I ported this to 9.2 (
> https://github.com/kelvich/postgres/commit/bb372). Then, after this I
> could select ordered records right from the index:
>
>   SELECT * FROM my_table
>   WHERE cube(array[dim1, dim2, dim3]) <@ cube
> '(128,4815,42),(137,162342,42)'
>   ORDER BY cube(array[dim1, dim2, dim3]) <*> 5;
>
> The main issue of such approach is the index size. In my case it was about
> 100GB for 100M of records. Therefore index building becomes very expensive
> disk-related operation. For the same reason reading a large number of
> records from the index is slow too.
>
> I came to Oleg Bartunov, Theodor Sigaev and after a while to Alexander
> Korotkov for any help. (I'm very thankful to them and glad that they agreed
> to meet, listen to me and give useful advices). Having discussed it we
> decided that there was several possible improvements for the cube extension:
>
>   * Adding point data type support to the cube extension in order to avoid
> storing of coincident upper left and lower right vertices, which may reduce
> the volume that leaf nodes occupy almost twice.
>   * Checking the split algorithm with big datasets and, if possible,
> improving it.
>   * Learning cube extension to store dimensions with different data types.
> Such index would be good alternative to compound key B-Tree multi-index
> (suitable for diapason queries and data ordering).
>   * Providing support for KNN with metrics induced by the different norms:
> euclidean, taxicab norm, p-norm. This can be useful in fields where we can
> extract signature: similar images search, similar audio search.
>
> I'd like to participate in GSoC with this improvements, and I'm very
> interested in any comments or suggestions about this feature list.
>

------
With best regards,
Alexander Korotkov.

Re: [HACKERS] Cube extension improvement, GSoC

Reply via email to