Hi, Jey! I remember I've started couple of threads related to cube extension: http://www.postgresql.org/message-id/4f30616d.3030...@gmail.com http://www.postgresql.org/message-id/4f3c16e9.90...@gmail.com Could you provide some feedback to GSoC proposal of Stas?
On Sat, Mar 23, 2013 at 3:10 AM, Stas Kelvich <stanc...@gmail.com> wrote: > Hello, > > some time ago I started working on the data search system (about 100-200M > of records) with queries consisted of several diapason and equality > conditions, e.g.: > > WHERE dim1 BETWEEN 128 AND 137 AND > WHERE dim2 BETWEEN 4815 AND 162342 AND > WHERE dim3 = 42 > ORDER BY dim1 ASC > > There are 6 or 7 search criteria in my task. In order to avoid full table > scan I started using R-Tree over cube data type: > > CREATE INDEX ON my_table USING GiST(cube(array[dim1, dim2, dim3])) > > For fast sorting I used Alexander Korotkov's patch for knngist ( > http://www.mail-archive.com/pgsql-hackers@postgresql.org/msg153676.html), > which changes metric for nearest neighbors search and allows to obtain > cubes ordered by a specific coordinate. Having changed some data types and > function/operator definitions I ported this to 9.2 ( > https://github.com/kelvich/postgres/commit/bb372). Then, after this I > could select ordered records right from the index: > > SELECT * FROM my_table > WHERE cube(array[dim1, dim2, dim3]) <@ cube > '(128,4815,42),(137,162342,42)' > ORDER BY cube(array[dim1, dim2, dim3]) <*> 5; > > The main issue of such approach is the index size. In my case it was about > 100GB for 100M of records. Therefore index building becomes very expensive > disk-related operation. For the same reason reading a large number of > records from the index is slow too. > > I came to Oleg Bartunov, Theodor Sigaev and after a while to Alexander > Korotkov for any help. (I'm very thankful to them and glad that they agreed > to meet, listen to me and give useful advices). Having discussed it we > decided that there was several possible improvements for the cube extension: > > * Adding point data type support to the cube extension in order to avoid > storing of coincident upper left and lower right vertices, which may reduce > the volume that leaf nodes occupy almost twice. > * Checking the split algorithm with big datasets and, if possible, > improving it. > * Learning cube extension to store dimensions with different data types. > Such index would be good alternative to compound key B-Tree multi-index > (suitable for diapason queries and data ordering). > * Providing support for KNN with metrics induced by the different norms: > euclidean, taxicab norm, p-norm. This can be useful in fields where we can > extract signature: similar images search, similar audio search. > > I'd like to participate in GSoC with this improvements, and I'm very > interested in any comments or suggestions about this feature list. > ------ With best regards, Alexander Korotkov.