Re: Adding KNN to madlib

Kazmi,Auon H Tue, 22 Nov 2016 14:58:20 -0800

Hi NJ,

I have implemented a first version of interface as suggested by you. Right now, 
I am just looking at classification task. I will generalize it to work for 
regression task as well. I have a question regarding output of the function. 
Should it just be the predicted label (or prediction value in case of 
regression)? Can you give an example of output?

Regards,

Auon Haidar

________________________________
From: Kazmi,Auon H <[email protected]>
Sent: Friday, November 18, 2016 3:16:00 AM
To: [email protected]
Subject: Re: Adding KNN to madlib

Hi NJ,

Thanks for your inputs!

I will go through everyone of them and try to incorporate them.

Regards,

Auon Haidar

________________________________
From: Nandish Jayaram <[email protected]>
Sent: Wednesday, November 16, 2016 2:29:05 PM
To: [email protected]
Subject: Re: Adding KNN to madlib

Hi Auon,

Defining the interface is a good start for k-NN. I have slightly modified
your interface to help it conform with other MADlib algorithms' interfaces.
Note that the output for each new data point is not the 'k' nearest
neighbors, but either a classification or regression task on the data point
based on its 'k' nearest neighbors. Every data point in the training data
will have an associated class label (regression value) in a different
column. Normally, the column containing the data point itself is called the
independent variable, and the column containing the class label is called
the dependent variable. If it is classification, you take a majority vote
of the class labels of the 'k' nearest neighbors, and if it is regression,
you average the dependent variable values of the 'k' nearest neighbors.
Here is a preliminary interface we could start with:

*knn*(
source_table, -- *TEXT, name of table containing training data.*
new_data_table, -- *TEXT, name of table containing new data on which
classification or regression has to be performed. Classification or
regression can be performed based on the type of "dependent_varname".*
output_table, -- *TEXT, name of the table where output predictors are
written. If this table is already present, an error is returned.*
dependent_varname, -- *TEXT, name of the independent variable column. If
this column is of type boolean/integer, we could probably perform k-NN
classification, and perform k-NN regression if this is of type double.*
independent_varname, -- *TEXT, column defining data points. Data points can
be of type SVEC or any type convertible to SVEC such as float[] or
integer[].*
k, --* INTEGER, (optional, default value could be some odd number, say 5)
number of neighbors to consider*
metric, -- *TEXT, (optional, default value could be what you are using now
for distance) the distance metric to use.*
);

For now you can just use the distance metric you had mentioned in an
earlier email. Note that the source_table and new_data_table are tables in
the database and not files.

Some pointers to help you start off with the implementation:
-
https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+Developers
is a very useful resource with a great hello-world example. It gives you
details about how to add a new module (k-NN would be a new module) to
MADlib.
- k-NN is a great candidate for parallelizing. Do try to use UDA (User
Defined Aggregates) in your implementation. This will require you to add a
C++ layer too, along with the SQL and python layers. Feel free to ask
specific questions about this after you have tried out the hello world
example.
- Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you more
Design Document - Apache MADlib<http://madlib.incubator.apache.org/design.pdf>
madlib.incubator.apache.org
1 AbstractionLayers Author FlorianSchoppmann Historyv0.6 
ReplacedUMLﬁgure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4 
Supportforfunctionpointersandsparse ...

information regarding the C++ abstraction layer in MADlib.

Feel free to shout out for help if you are stuck! Cheers. :)

NJ

On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <[email protected]> wrote:

> Hi Frank and NJ,
>
> Thanks for your comments. I will go through the suggestions provided by NJ.
>
> Current interface of KNN is as follows:
>
> 1) Input:
>
>        - Name of table having all the data points in n-dimensional vector
> form (Double                              Precision[ ])
>
>        - Column-name of these data points
>
>        - Name of file having that n-dim vector (v, say) whose k-nearest
> neighbours need to be               found from first table (Double
> Precision[ ])
>
>        - Column name having this vector
>
>        - value of 'k'
>
>
> It returns 'k' nearest neighbours of vector v from first table having data
> points.
>
>
>
> For now, I am using madlib's squared norm function to calculate distance
> between any two vectors. I will try to generalise that.
>
>
> Please suggest any other improvements.
>
>
>
> Thanks,
>
> Auon Haidar
>
> ________________________________
> From: Frank McQuillan <[email protected]>
> Sent: Tuesday, November 15, 2016 1:30:53 PM
> To: [email protected]
> Subject: Re: Adding KNN to madlib
>
> Auon,
>
> Thanks for working on kNN for MADlib.   Can you expand a little bit on your
> note, and post the interface that you are thinking about and description of
> the arguments?  Then people can comment on that.
>
> Thanks,
> Frank
>
> On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <[email protected]>
> wrote:
>
> > Hi Auon,
> >
> > Great going with your first version of k-NN implementation.
> > Some useful links for coding guidelines are at (see Developer
> > Documentation):
> > https://cwiki.apache.org/confluence/pages/viewpage.
> action?pageId=61319606
> > MADilb has something called as install-checks for basic testing. You can
> > look at any existing module for an example of the same. For instance,
> check
> > out the install check code for k-means at:
> > https://github.com/apache/incubator-madlib/tree/master/
> > src/ports/postgres/modules/kmeans/test
> >
> > I am sure others will pitch in to help you more with your other
> questions,
> > but these are some starters you can consider! Good luck!
> >
> > NJ
> >
> > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > I am a first year Computer Science graduate student at University of
> > > Florida working on implementing KNN in Madlib. I am ready with a first
> > > version of it but I don't know how to proceed with testing and adding
> it
> > to
> > > Madlib platform. Also, I am not clear on what standards do I have to
> > choose
> > > in the final implementation. My current version asks for the table name
> > and
> > > column name having vectors in which I have to find the neighbours. The
> > > other table given as input holds the vector whose K-NN needs to be
> found.
> > > It is assuming euclidean distance metric for distance calculation. It
> > would
> > > really help if somebody can share ideas on what can be added to this
> > > functionality.
> > >
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon Haidar Kazmi
> > >
> >
>

Re: Adding KNN to madlib

Reply via email to