Thanks NJ,

I will move forward in the suggested way.




Regards,

Auon

________________________________
From: Nandish Jayaram <[email protected]>
Sent: Wednesday, November 23, 2016 12:20:35 PM
To: [email protected]
Subject: Re: Adding KNN to madlib

Hey Auon,

Starting with only classification for now sounds like a good idea!
Yes, the output should be just the predicted label for each row.
If the table you want to run the classification task on is like the
following:
*id |   x   |  y*
1    10     10.5
2    30     31.5
3    20     22.5

then the output table could be something like the following:
*id |   x   |    y     |  predicted_label*
1    10     10.5          true
2    30     31.5          false
3    20     22.5          true

You are basically adding a new column to the input table called
"predicted_label", and assign the label for each row based on the k-NN.

We can certainly make it better, by modifying the kNN function interface.
But let's just keep it simple for now and work on that later.

NJ

On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <[email protected]> wrote:

>
> Hi NJ,
>
> I have implemented a first version of interface as suggested by you. Right
> now, I am just looking at classification task. I will generalize it to work
> for regression task as well. I have a question regarding output of the
> function. Should it just be the predicted label (or prediction value in
> case of regression)? Can you give an example of output?
>
>
>
>
>
> Regards,
>
> Auon Haidar
>
> ________________________________
> From: Kazmi,Auon H <[email protected]>
> Sent: Friday, November 18, 2016 3:16:00 AM
> To: [email protected]
> Subject: Re: Adding KNN to madlib
>
> Hi NJ,
>
> Thanks for your inputs!
>
> I will go through everyone of them and try to incorporate them.
>
>
>
> Regards,
>
> Auon Haidar
>
> ________________________________
> From: Nandish Jayaram <[email protected]>
> Sent: Wednesday, November 16, 2016 2:29:05 PM
> To: [email protected]
> Subject: Re: Adding KNN to madlib
>
> Hi Auon,
>
> Defining the interface is a good start for k-NN. I have slightly modified
> your interface to help it conform with other MADlib algorithms' interfaces.
> Note that the output for each new data point is not the 'k' nearest
> neighbors, but either a classification or regression task on the data point
> based on its 'k' nearest neighbors. Every data point in the training data
> will have an associated class label (regression value) in a different
> column. Normally, the column containing the data point itself is called the
> independent variable, and the column containing the class label is called
> the dependent variable. If it is classification, you take a majority vote
> of the class labels of the 'k' nearest neighbors, and if it is regression,
> you average the dependent variable values of the 'k' nearest neighbors.
> Here is a preliminary interface we could start with:
>
> *knn*(
> source_table, -- *TEXT, name of table containing training data.*
> new_data_table, -- *TEXT, name of table containing new data on which
> classification or regression has to be performed. Classification or
> regression can be performed based on the type of "dependent_varname".*
> output_table, -- *TEXT, name of the table where output predictors are
> written. If this table is already present, an error is returned.*
> dependent_varname, -- *TEXT, name of the independent variable column. If
> this column is of type boolean/integer, we could probably perform k-NN
> classification, and perform k-NN regression if this is of type double.*
> independent_varname, -- *TEXT, column defining data points. Data points can
> be of type SVEC or any type convertible to SVEC such as float[] or
> integer[].*
> k, --* INTEGER, (optional, default value could be some odd number, say 5)
> number of neighbors to consider*
> metric, -- *TEXT, (optional, default value could be what you are using now
> for distance) the distance metric to use.*
> );
>
> For now you can just use the distance metric you had mentioned in an
> earlier email. Note that the source_table and new_data_table are tables in
> the database and not files.
>
> Some pointers to help you start off with the implementation:
> -
> https://cwiki.apache.org/confluence/display/MADLIB/Quick+Start+Guide+for+
> Developers
> is a very useful resource with a great hello-world example. It gives you
> details about how to add a new module (k-NN would be a new module) to
> MADlib.
> - k-NN is a great candidate for parallelizing. Do try to use UDA (User
> Defined Aggregates) in your implementation. This will require you to add a
> C++ layer too, along with the SQL and python layers. Feel free to ask
> specific questions about this after you have tried out the hello world
> example.
> - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you
> more
> Design Document - Apache MADlib<http://madlib.incubator.apache.org/design.
> pdf>
> madlib.incubator.apache.org
> 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> ReplacedUMLfigure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4
> Supportforfunctionpointersandsparse ...
>
>
>
> information regarding the C++ abstraction layer in MADlib.
>
> Feel free to shout out for help if you are stuck! Cheers. :)
>
> NJ
>
> On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <[email protected]> wrote:
>
> > Hi Frank and NJ,
> >
> > Thanks for your comments. I will go through the suggestions provided by
> NJ.
> >
> > Current interface of KNN is as follows:
> >
> > 1) Input:
> >
> >        - Name of table having all the data points in n-dimensional vector
> > form (Double                              Precision[ ])
> >
> >        - Column-name of these data points
> >
> >        - Name of file having that n-dim vector (v, say) whose k-nearest
> > neighbours need to be               found from first table (Double
> > Precision[ ])
> >
> >        - Column name having this vector
> >
> >        - value of 'k'
> >
> >
> > It returns 'k' nearest neighbours of vector v from first table having
> data
> > points.
> >
> >
> >
> > For now, I am using madlib's squared norm function to calculate distance
> > between any two vectors. I will try to generalise that.
> >
> >
> > Please suggest any other improvements.
> >
> >
> >
> > Thanks,
> >
> > Auon Haidar
> >
> > ________________________________
> > From: Frank McQuillan <[email protected]>
> > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > To: [email protected]
> > Subject: Re: Adding KNN to madlib
> >
> > Auon,
> >
> > Thanks for working on kNN for MADlib.   Can you expand a little bit on
> your
> > note, and post the interface that you are thinking about and description
> of
> > the arguments?  Then people can comment on that.
> >
> > Thanks,
> > Frank
> >
> > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <[email protected]>
> > wrote:
> >
> > > Hi Auon,
> > >
> > > Great going with your first version of k-NN implementation.
> > > Some useful links for coding guidelines are at (see Developer
> > > Documentation):
> > > https://cwiki.apache.org/confluence/pages/viewpage.
> > action?pageId=61319606
> > > MADilb has something called as install-checks for basic testing. You
> can
> > > look at any existing module for an example of the same. For instance,
> > check
> > > out the install check code for k-means at:
> > > https://github.com/apache/incubator-madlib/tree/master/
> > > src/ports/postgres/modules/kmeans/test
> > >
> > > I am sure others will pitch in to help you more with your other
> > questions,
> > > but these are some starters you can consider! Good luck!
> > >
> > > NJ
> > >
> > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <[email protected]> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am a first year Computer Science graduate student at University of
> > > > Florida working on implementing KNN in Madlib. I am ready with a
> first
> > > > version of it but I don't know how to proceed with testing and adding
> > it
> > > to
> > > > Madlib platform. Also, I am not clear on what standards do I have to
> > > choose
> > > > in the final implementation. My current version asks for the table
> name
> > > and
> > > > column name having vectors in which I have to find the neighbours.
> The
> > > > other table given as input holds the vector whose K-NN needs to be
> > found.
> > > > It is assuming euclidean distance metric for distance calculation. It
> > > would
> > > > really help if somebody can share ideas on what can be added to this
> > > > functionality.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Regards,
> > > >
> > > > Auon Haidar Kazmi
> > > >
> > >
> >
>

Reply via email to