Thanks Babak. We will review the interface and get back to you soon. Meanwhile, you may go ahead with the implementations, for which I would suggest that you describe your approach in a design doc and share it here first before actually coding it up. The design doc should include necessary math formulations and relevant references if there is any.
> On Apr 8, 2016, at 8:19 AM, Babak Alipour <[email protected]> wrote: > > Thank you so much Xiaocheng for your reply. > > I was actually using the k-means interface and modifying it to fit kNN > needs. > > The function definitions as I have been thinking about are: > > FUNCTION MADLIB_SCHEMA.knn( > > rel_train VARCHAR, > > rel_test VARCHAR > > ignore_col_train VARCHAR, > > fn_dist VARCHAR, > > fn_dist_weighting VARCHAR, > > k INTEGER > > ) Returns knn_result > > @param rel_train Name of the relation containing the training input points > > @param rel_test Name of the relation containing the testing input points > {these are the points whose k-nearest neighbors are to be found in > rel_train } > > @param ignore_col_train Name of column to be ignored(e.g. the Class column > containing the classes which can be used for supervised learning elsewhere) > > @param fn_dist Name of a function with signature <tt>DOUBLE PRECISION[] x > DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that returns the distance > between two points. The default is the \ref > squared_dist_norm2(float8[],float8[]) "squared Euclidean distance". { based > on k-means example, basically take two points, output their distance } > I'd rather use minkowski distance and add another parameter, P, with a > default of 2, so that the function is more generalizable (P=1, Manhattan, > P=2, Euclidean, etc.) > > @param fn_dist_weighting Name of a function with signature <tt>DOUBLE > PRECISION -> DOUBLE PRECISION</tt> that, for each distance, returns a > weighted distance value. (e.g. None, 1-d and 1/d weighting schemes) > > @param k number of nearest neighbors > > > @returns a composite value > > Type knn_result > indexes INTEGER[][], > distances DOUBLE PRECISION[][] > - <tt>indexes</tt> - matrix[n][k] of k-nearest neighbors' indexes (n is > size of rel_test, i.e. |rel_test| , k is an input parameter which > specifies the number of nearest neighbors), which specifies data point > indexed i, whose k-nearest neighbors are in columns 0 to k-1. > Each column is the index of - <tt>distances</tt> - array of k-nearest > neighbors' distances > In other words, the row numbers of k-nearest neighbors in rel_train, for > the i*th* data point of rel_test, are in matrix[i][0:k-1], the distances > to those data points are in distances[i][0:k-1] > For implementation, I'm not quite sure how to store the results so that I > can efficiently update the k-nearest points as a pass over the data is > being done. Since kNN is non-iterative, I think it is possible to implement > kNN completely in SQL without calling external driver functions. It might > also be a good idea to store the results in another table instead of > returning a type like this or put the results in a temp table and then > return the result, feedback is always welcome. > I'm not specifically using a majority voting, because I do not want to > limit this implementation to kNN classification, but rather a general kNN > algorithm, then we could come up with another method that uses this kNN > implementation, then runs a majority vote for classification or average for > regression. > > > FUNCTION MADLIB_SCHEMA.__init_validate_args {Similar to the k-means > function __seeding_validate_args, k should be positive, k should be less > than the number of points, fn_dist should have the correct signature, > fn_dist_weighting should have the correct signature) > > > > I look forward to the community's feedback as I work on this module. > > > Best regards, > Babak > > > On Mon, Apr 4, 2016 at 1:28 AM, Xiaocheng Tang <[email protected] > <mailto:[email protected]>> wrote: > >> Hi Babak, >> >> Thank you for your interest in k-NN! >> https://issues.apache.org/jira/browse/MADLIB-927 >> <https://issues.apache.org/jira/browse/MADLIB-927> < >> https://issues.apache.org/jira/browse/MADLIB-927 >> <https://issues.apache.org/jira/browse/MADLIB-927>> >> >> The interface of new modules should be consistent with >> the existing ones in MADlib. In this case I would suggest >> studying the K-means first >> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html >> <https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html> < >> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html >> <https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html>> >> which in a similar way requires user input >> on k — number of centroids as well as a distance function >> including Manhattan, Euclidean or a general UDF with a specified signature. >> >> As the first steps, may I suggest that if you send a proposal of the >> function >> definitions and the parameters and return values as well as description of >> the functions and what they do. >> >> Based on that we can discuss the design of the interface and once it looks >> good you can start working on the actual implementation of the coding. >> When you get to implementation we can help you on technical challenges. >> >> Finally, check out the previous discussions about kNN in the forum >> >> https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw%40mail.gmail.com%3E >> < >> https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3ccakbqfzsqwzzumhaezq3jghlkzv0pvwucmi1dqg6yhveqdeb...@mail.gmail.com%3E >> >> <https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3ccakbqfzsqwzzumhaezq3jghlkzv0pvwucmi1dqg6yhveqdeb...@mail.gmail.com%3E> >>> >> >> Feel free to ask questions. Look forward to collaborating with you. >> >> Best, >> Xiaocheng >> >> >> >>> On Mar 28, 2016, at 3:34 PM, Babak Alipour <[email protected]> >> wrote: >>> >>> Greetings everyone, >>> >>> I am Babak Alipour, a student at University of Florida. I have been using >>> MADlib and was hoping to use kNN classification, which is unfortunately >> not >>> available so I decided to give implementing it a shot. >>> >>> Looking at the issue tracker, I found two Jiras regarding kNN: MADLIB-409 >>> and MADLIB-927. >>> The more recent JIRA mentions that a naive implementation, or linearly >>> searching through the data, is expected. >>> I have a few questions regarding the details the JIRA doesn't specify: >>> Generally, what is the interface of the module? This questions involves >>> questions such as: Where is the user expected to provide k, whether to >> use >>> distance weighting and distance metric (manhattan, euclidean, minkowski >>> with some p > 2)? >>> Another question is, how should the user specify the data points whose >>> k-nearest neighbors are desired? Is it some subset of the original data >>> table or points from another data table with same schema as the original >>> data table? >>> Also, are the output points to be kept in a separate table? >>> >>> I'd love to hear some feedback from the community so that I can move >>> forward with the implementation. >>> >>> Thanks in advance for your time. >>> >>> >>> Best regards, >>> *Babak Alipour ,* >>> *University of Florida* >> >> > > > -- > *Babak Alipour ,* > *University of Florida*
