Hi Babak, I noticed the KNN poster http://dsr.cise.ufl.edu/wp-content/uploads/2016/05/MADlib_Combined.pptx.pdf and was wondering if you have plans to make a pull request?
Frank On Fri, Apr 8, 2016 at 11:35 AM, Xiaocheng Tang <[email protected]> wrote: > Thanks Babak. We will review the interface and get back to you soon. > > Meanwhile, you may go ahead with the implementations, for which I would > suggest > that you describe your approach in a design doc and share it here first > before actually > coding it up. The design doc should include necessary math formulations and > relevant references if there is any. > > > On Apr 8, 2016, at 8:19 AM, Babak Alipour <[email protected]> > wrote: > > > > Thank you so much Xiaocheng for your reply. > > > > I was actually using the k-means interface and modifying it to fit kNN > > needs. > > > > The function definitions as I have been thinking about are: > > > > FUNCTION MADLIB_SCHEMA.knn( > > > > rel_train VARCHAR, > > > > rel_test VARCHAR > > > > ignore_col_train VARCHAR, > > > > fn_dist VARCHAR, > > > > fn_dist_weighting VARCHAR, > > > > k INTEGER > > > > ) Returns knn_result > > > > @param rel_train Name of the relation containing the training input > points > > > > @param rel_test Name of the relation containing the testing input > points > > {these are the points whose k-nearest neighbors are to be found in > > rel_train } > > > > @param ignore_col_train Name of column to be ignored(e.g. the Class > column > > containing the classes which can be used for supervised learning > elsewhere) > > > > @param fn_dist Name of a function with signature <tt>DOUBLE PRECISION[] x > > DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that returns the distance > > between two points. The default is the \ref > > squared_dist_norm2(float8[],float8[]) "squared Euclidean distance". { > based > > on k-means example, basically take two points, output their distance } > > I'd rather use minkowski distance and add another parameter, P, with a > > default of 2, so that the function is more generalizable (P=1, Manhattan, > > P=2, Euclidean, etc.) > > > > @param fn_dist_weighting Name of a function with signature <tt>DOUBLE > > PRECISION -> DOUBLE PRECISION</tt> that, for each distance, returns a > > weighted distance value. (e.g. None, 1-d and 1/d weighting schemes) > > > > @param k number of nearest neighbors > > > > > > @returns a composite value > > > > Type knn_result > > indexes INTEGER[][], > > distances DOUBLE PRECISION[][] > > - <tt>indexes</tt> - matrix[n][k] of k-nearest neighbors' indexes (n is > > size of rel_test, i.e. |rel_test| , k is an input parameter which > > specifies the number of nearest neighbors), which specifies data point > > indexed i, whose k-nearest neighbors are in columns 0 to k-1. > > Each column is the index of - <tt>distances</tt> - array of k-nearest > > neighbors' distances > > In other words, the row numbers of k-nearest neighbors in rel_train, for > > the i*th* data point of rel_test, are in matrix[i][0:k-1], the > distances > > to those data points are in distances[i][0:k-1] > > For implementation, I'm not quite sure how to store the results so that I > > can efficiently update the k-nearest points as a pass over the data is > > being done. Since kNN is non-iterative, I think it is possible to > implement > > kNN completely in SQL without calling external driver functions. It might > > also be a good idea to store the results in another table instead of > > returning a type like this or put the results in a temp table and then > > return the result, feedback is always welcome. > > I'm not specifically using a majority voting, because I do not want to > > limit this implementation to kNN classification, but rather a general kNN > > algorithm, then we could come up with another method that uses this kNN > > implementation, then runs a majority vote for classification or average > for > > regression. > > > > > > FUNCTION MADLIB_SCHEMA.__init_validate_args {Similar to the k-means > > function __seeding_validate_args, k should be positive, k should be less > > than the number of points, fn_dist should have the correct signature, > > fn_dist_weighting should have the correct signature) > > > > > > > > I look forward to the community's feedback as I work on this module. > > > > > > Best regards, > > Babak > > > > > > On Mon, Apr 4, 2016 at 1:28 AM, Xiaocheng Tang <[email protected] > <mailto:[email protected]>> wrote: > > > >> Hi Babak, > >> > >> Thank you for your interest in k-NN! > >> https://issues.apache.org/jira/browse/MADLIB-927 < > https://issues.apache.org/jira/browse/MADLIB-927> < > >> https://issues.apache.org/jira/browse/MADLIB-927 < > https://issues.apache.org/jira/browse/MADLIB-927>> > >> > >> The interface of new modules should be consistent with > >> the existing ones in MADlib. In this case I would suggest > >> studying the K-means first > >> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html > <https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html> > < > >> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html > <https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html>> > >> which in a similar way requires user input > >> on k — number of centroids as well as a distance function > >> including Manhattan, Euclidean or a general UDF with a specified > signature. > >> > >> As the first steps, may I suggest that if you send a proposal of the > >> function > >> definitions and the parameters and return values as well as description > of > >> the functions and what they do. > >> > >> Based on that we can discuss the design of the interface and once it > looks > >> good you can start working on the actual implementation of the coding. > >> When you get to implementation we can help you on technical challenges. > >> > >> Finally, check out the previous discussions about kNN in the forum > >> > >> https://mail-archives.apache.org/mod_mbox/incubator-madlib- > dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw% > 40mail.gmail.com%3E > >> < > >> https://mail-archives.apache.org/mod_mbox/incubator-madlib- > dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0 > [email protected]%3E <https://mail-archives.apache. > org/mod_mbox/incubator-madlib-dev/201603.mbox/% > 3ccakbqfzsqwzzumhaezq3jghlkzv0pvwucmi1dqg6yhveqdeb...@mail.gmail.com%3E> > >>> > >> > >> Feel free to ask questions. Look forward to collaborating with you. > >> > >> Best, > >> Xiaocheng > >> > >> > >> > >>> On Mar 28, 2016, at 3:34 PM, Babak Alipour <[email protected]> > >> wrote: > >>> > >>> Greetings everyone, > >>> > >>> I am Babak Alipour, a student at University of Florida. I have been > using > >>> MADlib and was hoping to use kNN classification, which is unfortunately > >> not > >>> available so I decided to give implementing it a shot. > >>> > >>> Looking at the issue tracker, I found two Jiras regarding kNN: > MADLIB-409 > >>> and MADLIB-927. > >>> The more recent JIRA mentions that a naive implementation, or linearly > >>> searching through the data, is expected. > >>> I have a few questions regarding the details the JIRA doesn't specify: > >>> Generally, what is the interface of the module? This questions involves > >>> questions such as: Where is the user expected to provide k, whether to > >> use > >>> distance weighting and distance metric (manhattan, euclidean, minkowski > >>> with some p > 2)? > >>> Another question is, how should the user specify the data points whose > >>> k-nearest neighbors are desired? Is it some subset of the original data > >>> table or points from another data table with same schema as the > original > >>> data table? > >>> Also, are the output points to be kept in a separate table? > >>> > >>> I'd love to hear some feedback from the community so that I can move > >>> forward with the implementation. > >>> > >>> Thanks in advance for your time. > >>> > >>> > >>> Best regards, > >>> *Babak Alipour ,* > >>> *University of Florida* > >> > >> > > > > > > -- > > *Babak Alipour ,* > > *University of Florida* > >
