Hi Babak,

Thank you for your interest in k-NN!
https://issues.apache.org/jira/browse/MADLIB-927 
<https://issues.apache.org/jira/browse/MADLIB-927>

The interface of new modules should be consistent with
the existing ones in MADlib. In this case I would suggest 
studying the K-means first 
https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html 
<https://madlib.incubator.apache.org/docs/latest/group__grp__kmeans.html>
which in a similar way requires user input
on k — number of centroids as well as a distance function 
including Manhattan, Euclidean or a general UDF with a specified signature. 

As the first steps, may I suggest that if you send a proposal of the function
definitions and the parameters and return values as well as description of
the functions and what they do.

Based on that we can discuss the design of the interface and once it looks
good you can start working on the actual implementation of the coding.
When you get to implementation we can help you on technical challenges.

Finally, check out the previous discussions about kNN in the forum
https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3CCAKBQfzSQWZzUmhAEZQ3jgHLkZV0pvwucMi1Dqg6YHVEQdEbjcw%40mail.gmail.com%3E
 
<https://mail-archives.apache.org/mod_mbox/incubator-madlib-dev/201603.mbox/%3ccakbqfzsqwzzumhaezq3jghlkzv0pvwucmi1dqg6yhveqdeb...@mail.gmail.com%3E>

Feel free to ask questions. Look forward to collaborating with you.

Best,
Xiaocheng



> On Mar 28, 2016, at 3:34 PM, Babak Alipour <[email protected]> wrote:
> 
> Greetings everyone,
> 
> I am Babak Alipour, a student at University of Florida. I have been using
> MADlib and was hoping to use kNN classification, which is unfortunately not
> available so I decided to give implementing it a shot.
> 
> Looking at the issue tracker, I found two Jiras regarding kNN: MADLIB-409
> and MADLIB-927.
> The more recent JIRA mentions that a naive implementation, or linearly
> searching through the data, is expected.
> I have a few questions regarding the details the JIRA doesn't specify:
> Generally, what is the interface of the module? This questions involves
> questions such as:  Where is the user expected to provide k, whether to use
> distance weighting and distance metric (manhattan, euclidean, minkowski
> with some p > 2)?
> Another question is, how should the user specify the data points whose
> k-nearest neighbors are desired? Is it some subset of the original data
> table or points from another data table with same schema as the original
> data table?
> Also, are the output points to be kept in a separate table?
> 
> I'd love to hear some feedback from the community so that I can move
> forward with the implementation.
> 
> Thanks in advance for your time.
> 
> 
> Best regards,
> *Babak Alipour ,*
> *University of Florida*

Reply via email to