Hi Auon, That's great! I think the best way to share your code with the community is by opening a pull request on github. Please do that and a lot of folks will be able to comment and give suggestions to you.
NJ On Sat, Dec 3, 2016 at 2:13 PM, Kazmi,Auon H <[email protected]> wrote: > Hi NJ, > > I got the solution to my problem. > > So, I might be done with my first version of interface of KNN for > classification as suggested by you, by Monday or so. I will generalise it > for regression and then please let me know how to share it with you guys. > After that, I can start making required changes as and when needed. > > > > regards, > > Auon Haidar > > ________________________________ > From: Kazmi,Auon H <[email protected]> > Sent: Thursday, December 1, 2016 2:59:21 PM > To: [email protected] > Subject: Re: Adding KNN to madlib > > Hi NJ, > > No, this is just an example I gave. So, I want in a postgres function to > iterate over the rows of a table given as a VARCHAR argument. > > FOR r IN EXECUTE format('SELECT * FROM %I', point_source) > > will do that. Now, r is a record, i.e. a row of table 'point_source'. I > want to store a particular column of that row r in a variable. Now, this > column name is also passed as VARCHAR argument to function. I am not able > to figure out the way to access this particular column from the current row > 'r'. > > > Basically, I am trying to iterate over my testing data one by one and pass > its vector column to a function that finds its label. > > > > Regards, > > Auon > > > ________________________________ > From: Nandish Jayaram <[email protected]> > Sent: Thursday, December 1, 2016 2:51:47 PM > To: [email protected] > Subject: Re: Adding KNN to madlib > > Hi Auon, > > My apologies for the late reply. > Can you please give me more information regarding the design approach you > have taken. Information like > what files you have created so far would be helpful. I am not sure I > understand your approach correctly > yet. Is the above snippet of code the only code you have, or do you have > some other files too? > > NJ > > On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <[email protected]> wrote: > > > Hi NJ, > > > > I got stuck at a place. Need a little help. > > > > Suppose I have a function that receives table_name and column_name as > > varchar. > > > > Now I would like to iterate through each rows of this table, while > > accessing the value of this column. I am doing something like this: > > > > > > CREATE OR REPLACE FUNCTION Foo( > > table_name VARCHAR, > > column_name VARCHAR > > ) RETURNS VOID AS > > $BODY$ > > DECLARE > > r record; > > b integer; > > BEGIN > > > > FOR r IN EXECUTE format('SELECT * FROM %I', point_source) > > LOOP > > > > b := r.column_name; > > > > END LOOP > > END > > > > So, everything works except column_name is a varchar. So, r.column_name > > won't give me the correponding column's value in extracted row r. So, > > suppose it is 'pid' in the given table, then b:= r.pid will give the > right > > result, but I want to get this effective statement from > > b := r.column_name; > > > > > > Could you please help. > > > > > > > > Regards, > > > > Auon > > > > ________________________________ > > From: Kazmi,Auon H <[email protected]> > > Sent: Friday, November 25, 2016 3:23:46 PM > > To: [email protected] > > Subject: Re: Adding KNN to madlib > > > > Thanks NJ, > > > > I will move forward in the suggested way. > > > > > > > > > > Regards, > > > > Auon > > > > ________________________________ > > From: Nandish Jayaram <[email protected]> > > Sent: Wednesday, November 23, 2016 12:20:35 PM > > To: [email protected] > > Subject: Re: Adding KNN to madlib > > > > Hey Auon, > > > > Starting with only classification for now sounds like a good idea! > > Yes, the output should be just the predicted label for each row. > > If the table you want to run the classification task on is like the > > following: > > *id | x | y* > > 1 10 10.5 > > 2 30 31.5 > > 3 20 22.5 > > > > then the output table could be something like the following: > > *id | x | y | predicted_label* > > 1 10 10.5 true > > 2 30 31.5 false > > 3 20 22.5 true > > > > You are basically adding a new column to the input table called > > "predicted_label", and assign the label for each row based on the k-NN. > > > > We can certainly make it better, by modifying the kNN function interface. > > But let's just keep it simple for now and work on that later. > > > > NJ > > > > On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <[email protected]> wrote: > > > > > > > > Hi NJ, > > > > > > I have implemented a first version of interface as suggested by you. > > Right > > > now, I am just looking at classification task. I will generalize it to > > work > > > for regression task as well. I have a question regarding output of the > > > function. Should it just be the predicted label (or prediction value in > > > case of regression)? Can you give an example of output? > > > > > > > > > > > > > > > > > > Regards, > > > > > > Auon Haidar > > > > > > ________________________________ > > > From: Kazmi,Auon H <[email protected]> > > > Sent: Friday, November 18, 2016 3:16:00 AM > > > To: [email protected] > > > Subject: Re: Adding KNN to madlib > > > > > > Hi NJ, > > > > > > Thanks for your inputs! > > > > > > I will go through everyone of them and try to incorporate them. > > > > > > > > > > > > Regards, > > > > > > Auon Haidar > > > > > > ________________________________ > > > From: Nandish Jayaram <[email protected]> > > > Sent: Wednesday, November 16, 2016 2:29:05 PM > > > To: [email protected] > > > Subject: Re: Adding KNN to madlib > > > > > > Hi Auon, > > > > > > Defining the interface is a good start for k-NN. I have slightly > modified > > > your interface to help it conform with other MADlib algorithms' > > interfaces. > > > Note that the output for each new data point is not the 'k' nearest > > > neighbors, but either a classification or regression task on the data > > point > > > based on its 'k' nearest neighbors. Every data point in the training > data > > > will have an associated class label (regression value) in a different > > > column. Normally, the column containing the data point itself is called > > the > > > independent variable, and the column containing the class label is > called > > > the dependent variable. If it is classification, you take a majority > vote > > > of the class labels of the 'k' nearest neighbors, and if it is > > regression, > > > you average the dependent variable values of the 'k' nearest neighbors. > > > Here is a preliminary interface we could start with: > > > > > > *knn*( > > > source_table, -- *TEXT, name of table containing training data.* > > > new_data_table, -- *TEXT, name of table containing new data on which > > > classification or regression has to be performed. Classification or > > > regression can be performed based on the type of "dependent_varname".* > > > output_table, -- *TEXT, name of the table where output predictors are > > > written. If this table is already present, an error is returned.* > > > dependent_varname, -- *TEXT, name of the independent variable column. > If > > > this column is of type boolean/integer, we could probably perform k-NN > > > classification, and perform k-NN regression if this is of type double.* > > > independent_varname, -- *TEXT, column defining data points. Data points > > can > > > be of type SVEC or any type convertible to SVEC such as float[] or > > > integer[].* > > > k, --* INTEGER, (optional, default value could be some odd number, say > 5) > > > number of neighbors to consider* > > > metric, -- *TEXT, (optional, default value could be what you are using > > now > > > for distance) the distance metric to use.* > > > ); > > > > > > For now you can just use the distance metric you had mentioned in an > > > earlier email. Note that the source_table and new_data_table are tables > > in > > > the database and not files. > > > > > > Some pointers to help you start off with the implementation: > > > - > > > https://cwiki.apache.org/confluence/display/MADLIB/ > > Quick+Start+Guide+for+ > > > Developers > > > is a very useful resource with a great hello-world example. It gives > you > > > details about how to add a new module (k-NN would be a new module) to > > > MADlib. > > > - k-NN is a great candidate for parallelizing. Do try to use UDA (User > > > Defined Aggregates) in your implementation. This will require you to > add > > a > > > C++ layer too, along with the SQL and python layers. Feel free to ask > > > specific questions about this after you have tried out the hello world > > > example. > > > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you > > > more > > > Design Document - Apache MADlib<http://madlib. > > incubator.apache.org/design. > > > pdf> > > > madlib.incubator.apache.org > > > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6 > > > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4 > > > Supportforfunctionpointersandsparse ... > > > > > > > > > > > > information regarding the C++ abstraction layer in MADlib. > > > > > > Feel free to shout out for help if you are stuck! Cheers. :) > > > > > > NJ > > > > > > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <[email protected]> wrote: > > > > > > > Hi Frank and NJ, > > > > > > > > Thanks for your comments. I will go through the suggestions provided > by > > > NJ. > > > > > > > > Current interface of KNN is as follows: > > > > > > > > 1) Input: > > > > > > > > - Name of table having all the data points in n-dimensional > > vector > > > > form (Double Precision[ ]) > > > > > > > > - Column-name of these data points > > > > > > > > - Name of file having that n-dim vector (v, say) whose > k-nearest > > > > neighbours need to be found from first table (Double > > > > Precision[ ]) > > > > > > > > - Column name having this vector > > > > > > > > - value of 'k' > > > > > > > > > > > > It returns 'k' nearest neighbours of vector v from first table having > > > data > > > > points. > > > > > > > > > > > > > > > > For now, I am using madlib's squared norm function to calculate > > distance > > > > between any two vectors. I will try to generalise that. > > > > > > > > > > > > Please suggest any other improvements. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Auon Haidar > > > > > > > > ________________________________ > > > > From: Frank McQuillan <[email protected]> > > > > Sent: Tuesday, November 15, 2016 1:30:53 PM > > > > To: [email protected] > > > > Subject: Re: Adding KNN to madlib > > > > > > > > Auon, > > > > > > > > Thanks for working on kNN for MADlib. Can you expand a little bit > on > > > your > > > > note, and post the interface that you are thinking about and > > description > > > of > > > > the arguments? Then people can comment on that. > > > > > > > > Thanks, > > > > Frank > > > > > > > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram < > [email protected]> > > > > wrote: > > > > > > > > > Hi Auon, > > > > > > > > > > Great going with your first version of k-NN implementation. > > > > > Some useful links for coding guidelines are at (see Developer > > > > > Documentation): > > > > > https://cwiki.apache.org/confluence/pages/viewpage. > > > > action?pageId=61319606 > > > > > MADilb has something called as install-checks for basic testing. > You > > > can > > > > > look at any existing module for an example of the same. For > instance, > > > > check > > > > > out the install check code for k-means at: > > > > > https://github.com/apache/incubator-madlib/tree/master/ > > > > > src/ports/postgres/modules/kmeans/test > > > > > > > > > > I am sure others will pitch in to help you more with your other > > > > questions, > > > > > but these are some starters you can consider! Good luck! > > > > > > > > > > NJ > > > > > > > > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <[email protected]> > > wrote: > > > > > > > > > > > Hi, > > > > > > > > > > > > I am a first year Computer Science graduate student at University > > of > > > > > > Florida working on implementing KNN in Madlib. I am ready with a > > > first > > > > > > version of it but I don't know how to proceed with testing and > > adding > > > > it > > > > > to > > > > > > Madlib platform. Also, I am not clear on what standards do I have > > to > > > > > choose > > > > > > in the final implementation. My current version asks for the > table > > > name > > > > > and > > > > > > column name having vectors in which I have to find the > neighbours. > > > The > > > > > > other table given as input holds the vector whose K-NN needs to > be > > > > found. > > > > > > It is assuming euclidean distance metric for distance > calculation. > > It > > > > > would > > > > > > really help if somebody can share ideas on what can be added to > > this > > > > > > functionality. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > Auon Haidar Kazmi > > > > > > > > > > > > > > > > > > > > >
