Re: kNN implementation

Xiaocheng Tang Fri, 08 Apr 2016 11:36:31 -0700

Thanks Babak. We will review the interface and get back to you soon. 

Meanwhile, you may go ahead with the implementations, for which I would suggest 
that you describe your approach in a design doc and share it here first before 
actually 
coding it up. The design doc should include necessary math formulations and 
relevant references if there is any.


> On Apr 8, 2016, at 8:19 AM, Babak Alipour <[email protected]> wrote:
> 
> Thank you so much Xiaocheng for your reply.
> 
> I was actually using the k-means interface and modifying it to fit kNN
> needs.
> 
> The function definitions as I have been thinking about are:
> 
> FUNCTION MADLIB_SCHEMA.knn(
> 
>        rel_train VARCHAR,
> 
>        rel_test VARCHAR
> 
>        ignore_col_train VARCHAR,
> 
>        fn_dist VARCHAR,
> 
>        fn_dist_weighting VARCHAR,
> 
>        k INTEGER
> 
> ) Returns knn_result
> 
> @param rel_train   Name of the relation containing the training input points
> 
> @param rel_test    Name of the relation containing the testing input points
> {these are the points whose k-nearest neighbors are to be found in
> rel_train }
> 
> @param ignore_col_train  Name of column to be ignored(e.g. the Class column
> containing the classes which can be used for supervised learning elsewhere)
> 
> @param fn_dist Name of a function with signature <tt>DOUBLE PRECISION[] x
> DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that returns the distance
> between two points. The default is the \ref
> squared_dist_norm2(float8[],float8[]) "squared Euclidean distance". { based
> on k-means example, basically take two points, output their distance  }
> I'd rather use  minkowski  distance and add another parameter, P, with a
> default of 2, so that the function is more generalizable (P=1, Manhattan,
> P=2, Euclidean, etc.)
> 
> @param fn_dist_weighting  Name of a function with signature <tt>DOUBLE
> PRECISION -> DOUBLE PRECISION</tt> that, for each  distance, returns a
> weighted distance value. (e.g. None, 1-d and 1/d weighting schemes)
> 
> @param k number of nearest neighbors
> 
> 
> @returns a composite value
> 
> Type knn_result
>    indexes INTEGER[][],
>    distances DOUBLE PRECISION[][]
> - <tt>indexes</tt> -  matrix[n][k] of k-nearest neighbors' indexes (n is
> size of rel_test, i.e.  |rel_test|  , k is an input parameter which
> specifies the number of nearest neighbors), which specifies data point
> indexed i,  whose k-nearest neighbors are in columns 0 to k-1.
> Each column is the index of - <tt>distances</tt> - array of k-nearest
> neighbors' distances
> In other words, the row numbers of k-nearest neighbors in rel_train, for
> the i*th*  data point of rel_test,  are in matrix[i][0:k-1], the distances
> to those data points are in distances[i][0:k-1]
> For implementation, I'm not quite sure how to store the results so that I
> can efficiently update the k-nearest points as a pass over the data is
> being done. Since kNN is non-iterative, I think it is possible to implement
> kNN completely in SQL without calling external driver functions. It might
> also be a good idea to store the results in another table instead of
> returning a type like this or put the results in a temp table and then
> return the result, feedback is always welcome.
> I'm not specifically using a majority voting, because I do not want to
> limit this implementation to kNN classification, but rather a general kNN
> algorithm, then we could come up with another method that uses this kNN
> implementation, then runs a majority vote for classification or average for
> regression.
> 
> 
> FUNCTION MADLIB_SCHEMA.__init_validate_args  {Similar to the k-means
> function __seeding_validate_args, k should be positive, k should be less
> than the number of points, fn_dist should have the correct signature,
> fn_dist_weighting should have the correct signature)
> 
> 
> 
> I look forward to the community's feedback as I work on this module.
> 
> 
> Best regards,
> Babak
> 
> 
> On Mon, Apr 4, 2016 at 1:28 AM, Xiaocheng Tang <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>> Hi Babak,
>> 
>> Thank you for your interest in k-NN!
>> https://issues.apache.org/jira/browse/MADLIB-927 
>> <https://issues.apache.org/jira/browse/MADLIB-927> <
>> https://issues.apache.org/jira/browse/MADLIB-927 
>> <https://issues.apache.org/jira/browse/MADLIB-927>>
>> 
>> The interface of new modules should be consistent with
>> the existing ones in MADlib. In this case I would suggest
>> studying the K-means first
>> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html 
>> <https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html> <
>> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html 
>> <https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html>>
>> which in a similar way requires user input
>> on k — number of centroids as well as a distance function
>> including Manhattan, Euclidean or a general UDF with a specified signature.
>> 
>> As the first steps, may I suggest that if you send a proposal of the
>> function
>> definitions and the parameters and return values as well as description of
>> the functions and what they do.
>> 
>> Based on that we can discuss the design of the interface and once it looks
>> good you can start working on the actual implementation of the coding.
>> When you get to implementation we can help you on technical challenges.
>> 
>> Finally, check out the previous discussions about kNN in the forum
>> 
>> https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw%40mail.gmail.com%3E
>> <
>> https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3ccakbqfzsqwzzumhaezq3jghlkzv0pvwucmi1dqg6yhveqdeb...@mail.gmail.com%3E
>>  
>> <https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3ccakbqfzsqwzzumhaezq3jghlkzv0pvwucmi1dqg6yhveqdeb...@mail.gmail.com%3E>
>>> 
>> 
>> Feel free to ask questions. Look forward to collaborating with you.
>> 
>> Best,
>> Xiaocheng
>> 
>> 
>> 
>>> On Mar 28, 2016, at 3:34 PM, Babak Alipour <[email protected]>
>> wrote:
>>> 
>>> Greetings everyone,
>>> 
>>> I am Babak Alipour, a student at University of Florida. I have been using
>>> MADlib and was hoping to use kNN classification, which is unfortunately
>> not
>>> available so I decided to give implementing it a shot.
>>> 
>>> Looking at the issue tracker, I found two Jiras regarding kNN: MADLIB-409
>>> and MADLIB-927.
>>> The more recent JIRA mentions that a naive implementation, or linearly
>>> searching through the data, is expected.
>>> I have a few questions regarding the details the JIRA doesn't specify:
>>> Generally, what is the interface of the module? This questions involves
>>> questions such as:  Where is the user expected to provide k, whether to
>> use
>>> distance weighting and distance metric (manhattan, euclidean, minkowski
>>> with some p > 2)?
>>> Another question is, how should the user specify the data points whose
>>> k-nearest neighbors are desired? Is it some subset of the original data
>>> table or points from another data table with same schema as the original
>>> data table?
>>> Also, are the output points to be kept in a separate table?
>>> 
>>> I'd love to hear some feedback from the community so that I can move
>>> forward with the implementation.
>>> 
>>> Thanks in advance for your time.
>>> 
>>> 
>>> Best regards,
>>> *Babak Alipour ,*
>>> *University of Florida*
>> 
>> 
> 
> 
> -- 
> *Babak Alipour ,*
> *University of Florida*

Re: kNN implementation

Reply via email to