Thank you so much Xiaocheng for your reply. I was actually using the k-means interface and modifying it to fit kNN needs.
The function definitions as I have been thinking about are: FUNCTION MADLIB_SCHEMA.knn( rel_train VARCHAR, rel_test VARCHAR ignore_col_train VARCHAR, fn_dist VARCHAR, fn_dist_weighting VARCHAR, k INTEGER ) Returns knn_result @param rel_train Name of the relation containing the training input points @param rel_test Name of the relation containing the testing input points {these are the points whose k-nearest neighbors are to be found in rel_train } @param ignore_col_train Name of column to be ignored(e.g. the Class column containing the classes which can be used for supervised learning elsewhere) @param fn_dist Name of a function with signature <tt>DOUBLE PRECISION[] x DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that returns the distance between two points. The default is the \ref squared_dist_norm2(float8[],float8[]) "squared Euclidean distance". { based on k-means example, basically take two points, output their distance } I'd rather use minkowski distance and add another parameter, P, with a default of 2, so that the function is more generalizable (P=1, Manhattan, P=2, Euclidean, etc.) @param fn_dist_weighting Name of a function with signature <tt>DOUBLE PRECISION -> DOUBLE PRECISION</tt> that, for each distance, returns a weighted distance value. (e.g. None, 1-d and 1/d weighting schemes) @param k number of nearest neighbors @returns a composite value Type knn_result indexes INTEGER[][], distances DOUBLE PRECISION[][] - <tt>indexes</tt> - matrix[n][k] of k-nearest neighbors' indexes (n is size of rel_test, i.e. |rel_test| , k is an input parameter which specifies the number of nearest neighbors), which specifies data point indexed i, whose k-nearest neighbors are in columns 0 to k-1. Each column is the index of - <tt>distances</tt> - array of k-nearest neighbors' distances In other words, the row numbers of k-nearest neighbors in rel_train, for the i*th* data point of rel_test, are in matrix[i][0:k-1], the distances to those data points are in distances[i][0:k-1] For implementation, I'm not quite sure how to store the results so that I can efficiently update the k-nearest points as a pass over the data is being done. Since kNN is non-iterative, I think it is possible to implement kNN completely in SQL without calling external driver functions. It might also be a good idea to store the results in another table instead of returning a type like this or put the results in a temp table and then return the result, feedback is always welcome. I'm not specifically using a majority voting, because I do not want to limit this implementation to kNN classification, but rather a general kNN algorithm, then we could come up with another method that uses this kNN implementation, then runs a majority vote for classification or average for regression. FUNCTION MADLIB_SCHEMA.__init_validate_args {Similar to the k-means function __seeding_validate_args, k should be positive, k should be less than the number of points, fn_dist should have the correct signature, fn_dist_weighting should have the correct signature) I look forward to the community's feedback as I work on this module. Best regards, Babak On Mon, Apr 4, 2016 at 1:28 AM, Xiaocheng Tang <xt...@pivotal.io> wrote: > Hi Babak, > > Thank you for your interest in k-NN! > https://issues.apache.org/jira/browse/MADLIB-927 < > https://issues.apache.org/jira/browse/MADLIB-927> > > The interface of new modules should be consistent with > the existing ones in MADlib. In this case I would suggest > studying the K-means first > https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html < > https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html> > which in a similar way requires user input > on k — number of centroids as well as a distance function > including Manhattan, Euclidean or a general UDF with a specified signature. > > As the first steps, may I suggest that if you send a proposal of the > function > definitions and the parameters and return values as well as description of > the functions and what they do. > > Based on that we can discuss the design of the interface and once it looks > good you can start working on the actual implementation of the coding. > When you get to implementation we can help you on technical challenges. > > Finally, check out the previous discussions about kNN in the forum > > https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw%40mail.gmail.com%3E > < > https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3ccakbqfzsqwzzumhaezq3jghlkzv0pvwucmi1dqg6yhveqdeb...@mail.gmail.com%3E > > > > Feel free to ask questions. Look forward to collaborating with you. > > Best, > Xiaocheng > > > > > On Mar 28, 2016, at 3:34 PM, Babak Alipour <babak.alip...@gmail.com> > wrote: > > > > Greetings everyone, > > > > I am Babak Alipour, a student at University of Florida. I have been using > > MADlib and was hoping to use kNN classification, which is unfortunately > not > > available so I decided to give implementing it a shot. > > > > Looking at the issue tracker, I found two Jiras regarding kNN: MADLIB-409 > > and MADLIB-927. > > The more recent JIRA mentions that a naive implementation, or linearly > > searching through the data, is expected. > > I have a few questions regarding the details the JIRA doesn't specify: > > Generally, what is the interface of the module? This questions involves > > questions such as: Where is the user expected to provide k, whether to > use > > distance weighting and distance metric (manhattan, euclidean, minkowski > > with some p > 2)? > > Another question is, how should the user specify the data points whose > > k-nearest neighbors are desired? Is it some subset of the original data > > table or points from another data table with same schema as the original > > data table? > > Also, are the output points to be kept in a separate table? > > > > I'd love to hear some feedback from the community so that I can move > > forward with the implementation. > > > > Thanks in advance for your time. > > > > > > Best regards, > > *Babak Alipour ,* > > *University of Florida* > > -- *Babak Alipour ,* *University of Florida*