Re: kNN implementation

Babak Alipour Fri, 08 Apr 2016 08:19:56 -0700

Thank you so much Xiaocheng for your reply.

I was actually using the k-means interface and modifying it to fit kNN
needs.


The function definitions as I have been thinking about are:

FUNCTION MADLIB_SCHEMA.knn(

        rel_train VARCHAR,

        rel_test VARCHAR

        ignore_col_train VARCHAR,

        fn_dist VARCHAR,

        fn_dist_weighting VARCHAR,

        k INTEGER

) Returns knn_result

@param rel_train   Name of the relation containing the training input points

@param rel_test    Name of the relation containing the testing input points
{these are the points whose k-nearest neighbors are to be found in
rel_train }

@param ignore_col_train  Name of column to be ignored(e.g. the Class column
containing the classes which can be used for supervised learning elsewhere)

@param fn_dist Name of a function with signature <tt>DOUBLE PRECISION[] x
DOUBLE PRECISION[] -> DOUBLE PRECISION</tt> that returns the distance
between two points. The default is the \ref
squared_dist_norm2(float8[],float8[]) "squared Euclidean distance". { based
on k-means example, basically take two points, output their distance  }
I'd rather use  minkowski  distance and add another parameter, P, with a
default of 2, so that the function is more generalizable (P=1, Manhattan,
P=2, Euclidean, etc.)

@param fn_dist_weighting  Name of a function with signature <tt>DOUBLE
PRECISION -> DOUBLE PRECISION</tt> that, for each  distance, returns a
weighted distance value. (e.g. None, 1-d and 1/d weighting schemes)

@param k number of nearest neighbors


@returns a composite value

Type knn_result
    indexes INTEGER[][],
    distances DOUBLE PRECISION[][]
- <tt>indexes</tt> -  matrix[n][k] of k-nearest neighbors' indexes (n is
 size of rel_test, i.e.  |rel_test|  , k is an input parameter which
specifies the number of nearest neighbors), which specifies data point
indexed i,  whose k-nearest neighbors are in columns 0 to k-1.
Each column is the index of - <tt>distances</tt> - array of k-nearest
neighbors' distances
In other words, the row numbers of k-nearest neighbors in rel_train, for
the i*th*  data point of rel_test,  are in matrix[i][0:k-1], the distances
to those data points are in distances[i][0:k-1]
For implementation, I'm not quite sure how to store the results so that I
can efficiently update the k-nearest points as a pass over the data is
being done. Since kNN is non-iterative, I think it is possible to implement
kNN completely in SQL without calling external driver functions. It might
also be a good idea to store the results in another table instead of
returning a type like this or put the results in a temp table and then
return the result, feedback is always welcome.
I'm not specifically using a majority voting, because I do not want to
limit this implementation to kNN classification, but rather a general kNN
algorithm, then we could come up with another method that uses this kNN
implementation, then runs a majority vote for classification or average for
regression.


FUNCTION MADLIB_SCHEMA.__init_validate_args  {Similar to the k-means
function __seeding_validate_args, k should be positive, k should be less
than the number of points, fn_dist should have the correct signature,
fn_dist_weighting should have the correct signature)



I look forward to the community's feedback as I work on this module.


Best regards,
Babak


On Mon, Apr 4, 2016 at 1:28 AM, Xiaocheng Tang <xt...@pivotal.io> wrote:

> Hi Babak,
>
> Thank you for your interest in k-NN!
> https://issues.apache.org/jira/browse/MADLIB-927 <
> https://issues.apache.org/jira/browse/MADLIB-927>
>
> The interface of new modules should be consistent with
> the existing ones in MADlib. In this case I would suggest
> studying the K-means first
> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html <
> https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html>
> which in a similar way requires user input
> on k — number of centroids as well as a distance function
> including Manhattan, Euclidean or a general UDF with a specified signature.
>
> As the first steps, may I suggest that if you send a proposal of the
> function
> definitions and the parameters and return values as well as description of
> the functions and what they do.
>
> Based on that we can discuss the design of the interface and once it looks
> good you can start working on the actual implementation of the coding.
> When you get to implementation we can help you on technical challenges.
>
> Finally, check out the previous discussions about kNN in the forum
>
> https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw%40mail.gmail.com%3E
> <
> https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3ccakbqfzsqwzzumhaezq3jghlkzv0pvwucmi1dqg6yhveqdeb...@mail.gmail.com%3E
> >
>
> Feel free to ask questions. Look forward to collaborating with you.
>
> Best,
> Xiaocheng
>
>
>
> > On Mar 28, 2016, at 3:34 PM, Babak Alipour <babak.alip...@gmail.com>
> wrote:
> >
> > Greetings everyone,
> >
> > I am Babak Alipour, a student at University of Florida. I have been using
> > MADlib and was hoping to use kNN classification, which is unfortunately
> not
> > available so I decided to give implementing it a shot.
> >
> > Looking at the issue tracker, I found two Jiras regarding kNN: MADLIB-409
> > and MADLIB-927.
> > The more recent JIRA mentions that a naive implementation, or linearly
> > searching through the data, is expected.
> > I have a few questions regarding the details the JIRA doesn't specify:
> > Generally, what is the interface of the module? This questions involves
> > questions such as:  Where is the user expected to provide k, whether to
> use
> > distance weighting and distance metric (manhattan, euclidean, minkowski
> > with some p > 2)?
> > Another question is, how should the user specify the data points whose
> > k-nearest neighbors are desired? Is it some subset of the original data
> > table or points from another data table with same schema as the original
> > data table?
> > Also, are the output points to be kept in a separate table?
> >
> > I'd love to hear some feedback from the community so that I can move
> > forward with the implementation.
> >
> > Thanks in advance for your time.
> >
> >
> > Best regards,
> > *Babak Alipour ,*
> > *University of Florida*
>
>


-- 
*Babak Alipour ,*
*University of Florida*

Re: kNN implementation

Reply via email to