Hi Auon,

That's great!
I think the best way to share your code with the community is by opening a
pull request on github. Please do that and a lot of folks will be able to
comment and give suggestions to you.

NJ

On Sat, Dec 3, 2016 at 2:13 PM, Kazmi,Auon H <[email protected]> wrote:

> Hi NJ,
>
> I got the solution to my problem.
>
> So, I might be done with my first version of interface of KNN for
> classification as suggested by you, by Monday or so. I will generalise it
> for regression and then please let me know how to share it with you guys.
> After that, I can start making required changes as and when needed.
>
>
>
> regards,
>
> Auon Haidar
>
> ________________________________
> From: Kazmi,Auon H <[email protected]>
> Sent: Thursday, December 1, 2016 2:59:21 PM
> To: [email protected]
> Subject: Re: Adding KNN to madlib
>
> Hi NJ,
>
> No, this is just an example I gave. So, I want in a postgres function to
> iterate over the rows of a table given as a VARCHAR argument.
>
> FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
>
> will do that. Now, r is a record, i.e. a row of table 'point_source'. I
> want to store a particular column of that row r in a variable. Now, this
> column name is also passed as VARCHAR argument to function. I am not able
> to figure out the way to access this particular column from the current row
> 'r'.
>
>
> Basically, I am trying to iterate over my testing data one by one and pass
> its vector column to a function that finds its label.
>
>
>
> Regards,
>
> Auon
>
>
> ________________________________
> From: Nandish Jayaram <[email protected]>
> Sent: Thursday, December 1, 2016 2:51:47 PM
> To: [email protected]
> Subject: Re: Adding KNN to madlib
>
> Hi Auon,
>
> My apologies for the late reply.
> Can you please give me more information regarding the design approach you
> have taken. Information like
> what files you have created so far would be helpful. I am not sure I
> understand your approach correctly
> yet. Is the above snippet of code the only code you have, or do you have
> some other files too?
>
> NJ
>
> On Tue, Nov 29, 2016 at 10:06 PM, Kazmi,Auon H <[email protected]> wrote:
>
> > Hi NJ,
> >
> > I got stuck at a place. Need a little help.
> >
> > Suppose I have a function that receives table_name and column_name as
> > varchar.
> >
> > Now I would like to iterate through each rows of this table, while
> > accessing the value of this column. I am doing something like this:
> >
> >
> > CREATE OR REPLACE FUNCTION Foo(
> > table_name VARCHAR,
> > column_name VARCHAR
> > ) RETURNS VOID AS
> > $BODY$
> > DECLARE
> >     r record;
> >     b integer;
> > BEGIN
> >
> >     FOR r IN EXECUTE format('SELECT * FROM %I', point_source)
> >     LOOP
> >
> >         b := r.column_name;
> >
> >    END LOOP
> > END
> >
> > So, everything works except column_name is a varchar. So, r.column_name
> > won't give me the correponding column's value in extracted row r. So,
> > suppose it is 'pid' in the given table, then b:= r.pid will give the
> right
> > result, but I want to get this effective statement from
> > b := r.column_name;
> >
> >
> > Could you please help.
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> > ________________________________
> > From: Kazmi,Auon H <[email protected]>
> > Sent: Friday, November 25, 2016 3:23:46 PM
> > To: [email protected]
> > Subject: Re: Adding KNN to madlib
> >
> > Thanks NJ,
> >
> > I will move forward in the suggested way.
> >
> >
> >
> >
> > Regards,
> >
> > Auon
> >
> > ________________________________
> > From: Nandish Jayaram <[email protected]>
> > Sent: Wednesday, November 23, 2016 12:20:35 PM
> > To: [email protected]
> > Subject: Re: Adding KNN to madlib
> >
> > Hey Auon,
> >
> > Starting with only classification for now sounds like a good idea!
> > Yes, the output should be just the predicted label for each row.
> > If the table you want to run the classification task on is like the
> > following:
> > *id |   x   |  y*
> > 1    10     10.5
> > 2    30     31.5
> > 3    20     22.5
> >
> > then the output table could be something like the following:
> > *id |   x   |    y     |  predicted_label*
> > 1    10     10.5          true
> > 2    30     31.5          false
> > 3    20     22.5          true
> >
> > You are basically adding a new column to the input table called
> > "predicted_label", and assign the label for each row based on the k-NN.
> >
> > We can certainly make it better, by modifying the kNN function interface.
> > But let's just keep it simple for now and work on that later.
> >
> > NJ
> >
> > On Tue, Nov 22, 2016 at 2:52 PM, Kazmi,Auon H <[email protected]> wrote:
> >
> > >
> > > Hi NJ,
> > >
> > > I have implemented a first version of interface as suggested by you.
> > Right
> > > now, I am just looking at classification task. I will generalize it to
> > work
> > > for regression task as well. I have a question regarding output of the
> > > function. Should it just be the predicted label (or prediction value in
> > > case of regression)? Can you give an example of output?
> > >
> > >
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon Haidar
> > >
> > > ________________________________
> > > From: Kazmi,Auon H <[email protected]>
> > > Sent: Friday, November 18, 2016 3:16:00 AM
> > > To: [email protected]
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hi NJ,
> > >
> > > Thanks for your inputs!
> > >
> > > I will go through everyone of them and try to incorporate them.
> > >
> > >
> > >
> > > Regards,
> > >
> > > Auon Haidar
> > >
> > > ________________________________
> > > From: Nandish Jayaram <[email protected]>
> > > Sent: Wednesday, November 16, 2016 2:29:05 PM
> > > To: [email protected]
> > > Subject: Re: Adding KNN to madlib
> > >
> > > Hi Auon,
> > >
> > > Defining the interface is a good start for k-NN. I have slightly
> modified
> > > your interface to help it conform with other MADlib algorithms'
> > interfaces.
> > > Note that the output for each new data point is not the 'k' nearest
> > > neighbors, but either a classification or regression task on the data
> > point
> > > based on its 'k' nearest neighbors. Every data point in the training
> data
> > > will have an associated class label (regression value) in a different
> > > column. Normally, the column containing the data point itself is called
> > the
> > > independent variable, and the column containing the class label is
> called
> > > the dependent variable. If it is classification, you take a majority
> vote
> > > of the class labels of the 'k' nearest neighbors, and if it is
> > regression,
> > > you average the dependent variable values of the 'k' nearest neighbors.
> > > Here is a preliminary interface we could start with:
> > >
> > > *knn*(
> > > source_table, -- *TEXT, name of table containing training data.*
> > > new_data_table, -- *TEXT, name of table containing new data on which
> > > classification or regression has to be performed. Classification or
> > > regression can be performed based on the type of "dependent_varname".*
> > > output_table, -- *TEXT, name of the table where output predictors are
> > > written. If this table is already present, an error is returned.*
> > > dependent_varname, -- *TEXT, name of the independent variable column.
> If
> > > this column is of type boolean/integer, we could probably perform k-NN
> > > classification, and perform k-NN regression if this is of type double.*
> > > independent_varname, -- *TEXT, column defining data points. Data points
> > can
> > > be of type SVEC or any type convertible to SVEC such as float[] or
> > > integer[].*
> > > k, --* INTEGER, (optional, default value could be some odd number, say
> 5)
> > > number of neighbors to consider*
> > > metric, -- *TEXT, (optional, default value could be what you are using
> > now
> > > for distance) the distance metric to use.*
> > > );
> > >
> > > For now you can just use the distance metric you had mentioned in an
> > > earlier email. Note that the source_table and new_data_table are tables
> > in
> > > the database and not files.
> > >
> > > Some pointers to help you start off with the implementation:
> > > -
> > > https://cwiki.apache.org/confluence/display/MADLIB/
> > Quick+Start+Guide+for+
> > > Developers
> > > is a very useful resource with a great hello-world example. It gives
> you
> > > details about how to add a new module (k-NN would be a new module) to
> > > MADlib.
> > > - k-NN is a great candidate for parallelizing. Do try to use UDA (User
> > > Defined Aggregates) in your implementation. This will require you to
> add
> > a
> > > C++ layer too, along with the SQL and python layers. Feel free to ask
> > > specific questions about this after you have tried out the hello world
> > > example.
> > > - Chapter 1 in http://madlib.incubator.apache.org/design.pdf gives you
> > > more
> > > Design Document - Apache MADlib<http://madlib.
> > incubator.apache.org/design.
> > > pdf>
> > > madlib.incubator.apache.org
> > > 1 AbstractionLayers Author FlorianSchoppmann Historyv0.6
> > > ReplacedUML?gure[RahulIyer] v0.5 Initialrevisionofdesigndocument v0.4
> > > Supportforfunctionpointersandsparse ...
> > >
> > >
> > >
> > > information regarding the C++ abstraction layer in MADlib.
> > >
> > > Feel free to shout out for help if you are stuck! Cheers. :)
> > >
> > > NJ
> > >
> > > On Tue, Nov 15, 2016 at 2:56 PM, Kazmi,Auon H <[email protected]> wrote:
> > >
> > > > Hi Frank and NJ,
> > > >
> > > > Thanks for your comments. I will go through the suggestions provided
> by
> > > NJ.
> > > >
> > > > Current interface of KNN is as follows:
> > > >
> > > > 1) Input:
> > > >
> > > >        - Name of table having all the data points in n-dimensional
> > vector
> > > > form (Double                              Precision[ ])
> > > >
> > > >        - Column-name of these data points
> > > >
> > > >        - Name of file having that n-dim vector (v, say) whose
> k-nearest
> > > > neighbours need to be               found from first table (Double
> > > > Precision[ ])
> > > >
> > > >        - Column name having this vector
> > > >
> > > >        - value of 'k'
> > > >
> > > >
> > > > It returns 'k' nearest neighbours of vector v from first table having
> > > data
> > > > points.
> > > >
> > > >
> > > >
> > > > For now, I am using madlib's squared norm function to calculate
> > distance
> > > > between any two vectors. I will try to generalise that.
> > > >
> > > >
> > > > Please suggest any other improvements.
> > > >
> > > >
> > > >
> > > > Thanks,
> > > >
> > > > Auon Haidar
> > > >
> > > > ________________________________
> > > > From: Frank McQuillan <[email protected]>
> > > > Sent: Tuesday, November 15, 2016 1:30:53 PM
> > > > To: [email protected]
> > > > Subject: Re: Adding KNN to madlib
> > > >
> > > > Auon,
> > > >
> > > > Thanks for working on kNN for MADlib.   Can you expand a little bit
> on
> > > your
> > > > note, and post the interface that you are thinking about and
> > description
> > > of
> > > > the arguments?  Then people can comment on that.
> > > >
> > > > Thanks,
> > > > Frank
> > > >
> > > > On Tue, Nov 15, 2016 at 9:30 AM, Nandish Jayaram <
> [email protected]>
> > > > wrote:
> > > >
> > > > > Hi Auon,
> > > > >
> > > > > Great going with your first version of k-NN implementation.
> > > > > Some useful links for coding guidelines are at (see Developer
> > > > > Documentation):
> > > > > https://cwiki.apache.org/confluence/pages/viewpage.
> > > > action?pageId=61319606
> > > > > MADilb has something called as install-checks for basic testing.
> You
> > > can
> > > > > look at any existing module for an example of the same. For
> instance,
> > > > check
> > > > > out the install check code for k-means at:
> > > > > https://github.com/apache/incubator-madlib/tree/master/
> > > > > src/ports/postgres/modules/kmeans/test
> > > > >
> > > > > I am sure others will pitch in to help you more with your other
> > > > questions,
> > > > > but these are some starters you can consider! Good luck!
> > > > >
> > > > > NJ
> > > > >
> > > > > On Mon, Nov 14, 2016 at 10:41 PM, Kazmi,Auon H <[email protected]>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I am a first year Computer Science graduate student at University
> > of
> > > > > > Florida working on implementing KNN in Madlib. I am ready with a
> > > first
> > > > > > version of it but I don't know how to proceed with testing and
> > adding
> > > > it
> > > > > to
> > > > > > Madlib platform. Also, I am not clear on what standards do I have
> > to
> > > > > choose
> > > > > > in the final implementation. My current version asks for the
> table
> > > name
> > > > > and
> > > > > > column name having vectors in which I have to find the
> neighbours.
> > > The
> > > > > > other table given as input holds the vector whose K-NN needs to
> be
> > > > found.
> > > > > > It is assuming euclidean distance metric for distance
> calculation.
> > It
> > > > > would
> > > > > > really help if somebody can share ideas on what can be added to
> > this
> > > > > > functionality.
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Regards,
> > > > > >
> > > > > > Auon Haidar Kazmi
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to