[
https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15836850#comment-15836850
]
ASF GitHub Bot commented on MADLIB-927:
---------------------------------------
Github user njayaram2 commented on a diff in the pull request:
https://github.com/apache/incubator-madlib/pull/81#discussion_r97676430
--- Diff: src/ports/postgres/modules/knn/knn.sql_in ---
@@ -0,0 +1,440 @@
+/* -----------------------------------------------------------------------
*//**
+ *
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ *//*
----------------------------------------------------------------------- */
+
+
+/* -----------------------------------------------------------------------
*//**
+ *
+ * @file knn.sql_in
+ * @brief Set of functions for k-nearest neighbors.
+ * @sa For a brief introduction to k-nearest neighbors algorithm for
regression and classification,
+ * see the module description \ref grp_knn.
+ *
+ *
+ *//*
----------------------------------------------------------------------- */
+
+m4_include(`SQLCommon.m4')
+
+
+/**
+@addtogroup grp_knn
+
+<div class="toc"><b>Contents</b>
+<ul>
+<li class="level1"><a href="#knnfunction">KNN Function</a></li>
+<li class="level1"><a href="#output">Output Format</a></li>
+<li class="level1"><a href="#examples">Examples</a></li>
+<li class="level1"><a href="#background">Technical Background</a></li>
+<li class="level1"><a href="#literature">Literature</a></li>
+<li class="level1"><a href="#related">Related Topics</a></li>
+</ul>
+</div>
+
+@brief Finds k nearest data points to the given data point and outputs
majority vote value of output classes in case of classification and average
value of target values for regression task.
+
+k-Nearest Neighbors is a method for finding \f$ k closest points to a
+given data point in terms of a given metric. Its input consist of
+data points as features from testing examples. For a given \f$ k, it
+looks for \f$ k closest points in training set for each of the data
+points in test set. Algorithm generates one output per testing example.
+The output of KNN depends on the type of task:
+For Classification, the output is majority vote of the classes of
+the \f$ k nearest data points. The testing example gets assigned the
+most popular class among nearest neighbors.
+For Regression, the output is average of the values of \f$ k nearest
+neighbors of the given testing example.
+
+@anchor knnfunction
+@par KNN Function
+
+The k-means algorithm can be invoked in three ways, depending on the
+number and values of arguments:
+
+- Use the method explicitly asking for value of k.
+<pre class="syntax">
+knn( point_source,
+ point_column_name,
+ label_column_name,
+ test_source,
+ test_column_name,
+ id_column_name,
+ output_table,
+ operation,
+ k
+ )
+</pre>
+
+- The method that assumes k=1 if k is not passed as a ninth argument.
+<pre class="syntax">
+knn( point_source,
+ point_column_name,
+ label_column_name,
+ test_source,
+ test_column_name,
+ id_column_name,
+ output_table,
+ operation
+ )
+</pre>
+
+- The method that describe the type of arguments required by algorithm.
+<pre class="syntax">
+knn( 'help'
+ )
+</pre>
+
+\b Arguments
+<dl class="arglist">
+<dt>point_source</dt>
+<dd>TEXT. The name of the table containing the training data points.
+
+Training data points are expected to be stored row-wise,
+in a column of type <tt>DOUBLE PRECISION[]</tt>.
+</dd>
+
+<dt>point_column_name</dt>
+<dd>TEXT. The name of the column with training data points.</dd>
+
+<dt>label_column_name</dt>
+<dd>TEXT. The name of the column with labels/values of training data
points.</dd>
+
+<dt>test_source</dt>
+<dd>TEXT. The name of the table containing the test data points.
+
+Testing data points are expected to be stored row-wise,
+in a column of type <tt>DOUBLE PRECISION[]</tt>.
+</dd>
+
+<dt>test_column_name</dt>
+<dd>TEXT. The name of the column with testing data points.</dd>
+
+<dt>id_column_name</dt>
+<dd>TEXT. Name of the column having ids of data points in test data
table.</dd>
+
+<dt>output_table</dt>
+<dd>TEXT. Name of the table to store final results.</dd>
+
+<dt>operation</dt>
+<dd>TEXT. the type of task; \f$ r for regression and \f$ c for
classification.</dd>
+
+<dt>k (optional)</dt>
+<dd>INTEGER. The number of nearest neighbors to consider.</dd>
+
+</dl>
+
+
+@anchor output
+@par Output Format
+
+The output of the KNN module is a table with the following columns:
+<table class="output">
+ <tr>
+ <th>id</th>
+ <td>INTEGER. The ids of test data points.</td>
+ </tr>
+ <tr>
+ <th>test_column_name</th>
+ <td>DOUBLE PRECISION[]. The test data points.</td>
+ </tr>
+ <tr>
+ <th>predLabel</th>
+ <td>INTEGER. The output of KNN- label in case of classification,
average value in case of regression.</td>
+ </tr>
+</table>
+
+
+@anchor examples
+@examp
+
+-# Prepare some training data.
+<pre class="example">
+CREATE TABLE knn_train_data (id integer, data integer[], label float);
+COPY knn_train_data (id, data, label) from stdin delimiter '|';
+1|{1,1}|1.0
+2|{2,2}|1.0
+3|{3,3}|1.0
+4|{4,4}|1.0
+5|{4,5}|1.0
+6|{20,50}|0.0
+7|{10,31}|0.0
+8|{81,13}|0.0
+9|{1,111}|0.0
+\\.
+</pre>
+
+-# Prepare some testing data.
+<pre class="example">
+CREATE TABLE knn_test_data (id integer, data integer[]);
+COPY knn_test_data (id, data) from stdin delimiter '|';
+1|{2,1}
+2|{2,6}
+3|{15,40}
+4|{12,1}
+5|{2,90}
+6|{50,45}
+\\.
+</pre>
+
+-# Run KNN for classification:
+<pre class="example">
+SELECT * FROM madlib.knn( 'knn_train_data',
+ 'data',
+ 'label',
+ 'knn_test_data',
+ 'data',
+ 'id',
+ 'madlib_knn_result_classification',
+ 'c',
+ 3
+ );
+</pre>
+Result:
+<pre class="result">
+SELECT * from madlib_knn_result_classification;
+
+ id | data | predLabel
+-----+----------+-----------
+ 1 | {2,1} | 1
+ 2 | {2,6} | 1
+ 3 | {15,40} | 0
+ 4 | {12,1} | 1
+ 5 | {2,90} | 0
+ 6 | {50,45} | 0
+
+</pre>
+
+-# Run KNN for regression:
+<pre class="example">
+SELECT * FROM madlib.knn( 'knn_train_data',
+ 'data',
+ 'label',
+ 'knn_test_data',
+ 'data',
+ 'id',
+ 'madlib_knn_result_regression',
+ 'r',
+ 3
+ );
+</pre>
+Result:
+<pre class="result">
+SELECT * from madlib_knn_result_regression;
+
+ id | data | predLabel
+-----+----------+-----------
+ 1 | {2,1} | 1
+ 2 | {2,6} | 1
+ 3 | {15,40} | 0.5
+ 4 | {12,1} | 1
+ 5 | {2,90} | 0.25
+ 6 | {50,45} | 0.25
+
+</pre>
+
+
+
+@anchor background
+@par Technical Background
+
+The training data points are vectors in a multidimensional feature space,
+each with a class label. The training phase of the algorithm consists
+only of storing the feature vectors and class labels of the training
points.
+
+In the classification phase, k is a user-defined constant, and an
unlabeled
+vector (a test point) is classified by assigning the label which is most
+frequent among the k training samples nearest to that test point.
+In case of regression, average of the values of these k training samples
+is assigned to the test point.
+
+A commonly used distance metric for continuous variables is Euclidean
distance.
+
+
+
+@anchor literature
+@literature
+
+@anchor kmeans-lit-1
+[1] Wikipedia, k-nearest neighbors algorithm,
+ https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm
+
+@anchor kmeans-lit-2
+[2] N. S. Altman: An Introduction to Kernel and Nearest-Neighbor
Nonparametric Regression
+
http://www.stat.washington.edu/courses/stat527/s13/readings/Altman_AmStat_1992.pdf
+
+@anchor kmeans-lit-3
+[3] Gongde Guo1, Hui Wang, David Bell, Yaxin Bi, Kieran Greer: KNN
Model-Based Approach in Classification,
+
https://ai2-s2-pdfs.s3.amazonaws.com/a7e2/814ec5db800d2f8c4313fd436e9cf8273821.pdf
+
+
+@anchor related
+@par Related Topics
+
+File knn.sql_in documenting the k-Means SQL functions
+
+
+
+
+@internal
+@sa namespace knn (documenting the implementation in Python)
+@endinternal
+*/
+
+
+
+
+
+
+
+
+
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__knn_validate_src(
+rel_source VARCHAR
+) RETURNS VOID AS $$
+ PythonFunction(knn, knn, knn_validate_src)
+$$ LANGUAGE plpythonu
+m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
+
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
+ arg1 VARCHAR
+) RETURNS VOID AS $$
+BEGIN
+ IF arg1 = 'help' THEN
+ RAISE NOTICE 'You need to enter following arguments in order:
+ Argument 1: Training data table having training features as vector
column and labels
+ Argument 2: Name of column having feature vectors in training data table
+ Argument 3: Name of column having actual label/vlaue for corresponding
feature vector in training data table
+ Argument 4: Test data table having features as vector column. Id of
features is mandatory
+ Argument 5: Name of column having feature vectors in test data table
+ Argument 6: Name of column having feature vector Ids in test data table
+ Argument 7: Name of output table
+ Argument 8: c for classification task, r for regression task
+ Argument 9: value of k. Default will go as 1';
+ END IF;
+END;
+$$ LANGUAGE plpgsql VOLATILE
+m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
+) RETURNS VOID AS $$
+BEGIN
+ EXECUTE $sql$ select * from MADLIB_SCHEMA.knn('help') $sql$;
+END;
+$$ LANGUAGE plpgsql VOLATILE
+m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
+
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
+ point_source VARCHAR,
+ point_column_name VARCHAR,
+ label_column_name VARCHAR,
+ test_source VARCHAR,
+ test_column_name VARCHAR,
+ id_column_name VARCHAR,
+ output_table VARCHAR,
+ operation VARCHAR,
+ k INTEGER
+) RETURNS VARCHAR AS $$
+DECLARE
+ class_test_source REGCLASS;
+ class_point_source REGCLASS;
+ l FLOAT;
+ outputTableFlag INTEGER;
+ id INTEGER;
+ vector DOUBLE PRECISION[];
+ cur_pid integer;
+ oldClientMinMessages VARCHAR;
+ returnstring VARCHAR;
+BEGIN
+ oldClientMinMessages :=
+ (SELECT setting FROM pg_settings WHERE name =
'client_min_messages');
+ EXECUTE 'SET client_min_messages TO warning';
+ PERFORM MADLIB_SCHEMA.__knn_validate_src(test_source);
+ PERFORM MADLIB_SCHEMA.__knn_validate_src(point_source);
+ class_test_source := test_source;
+ class_point_source := point_source;
+ --checks
+ IF (k <= 0) THEN
+ RAISE EXCEPTION 'KNN error: Number of neighbors k must be a
positive integer.';
+ END IF;
+ IF (operation != 'c' AND operation != 'r') THEN
+ RAISE EXCEPTION 'KNN error: put r for regression OR c for
classification.';
+ END IF;
+ PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
+
+ EXECUTE
+ $sql$
+ SELECT count(*) FROM information_schema.tables WHERE table_name =
'$sql$ || output_table || $sql$'$sql$ into outputTableFlag;
+ IF (outputTableFlag != 0) THEN
+ RAISE Exception '% table exists. Drop it or use different name',
output_table;
+ END IF;
+ --EXECUTE format('DROP TABLE IF EXISTS %I',output_table);
+ --EXECUTE format('CREATE TABLE %I(%I integer, %I DOUBLE PRECISION[],
predlabel float)',output_table,id_column_name,test_column_name);
+
+ EXECUTE
+ $sql$
+ DROP TABLE IF EXISTS pg_temp.madlib_knn_interm;
+ CREATE TABLE pg_temp.madlib_knn_interm AS
+ select * from ( select row_number() over (partition by test_id order by
dist) as r, x.* from (select test. $sql$ || id_column_name || $sql$ as test_id,
madlib.squared_dist_norm2(train.$sql$ || point_column_name || $sql$,test.$sql$
|| test_column_name || $sql$) as dist, $sql$ || label_column_name || $sql$ from
$sql$ || textin(regclassout(point_source)) || $sql$ as train, $sql$ ||
textin(regclassout(test_source)) || $sql$ as test)x)y where y.r <= $sql$ || k;
+ IF (operation = 'c') THEN
--- End diff --
I was playing around with knn a bit, and observed that using `x` and `y` in
`|| $sql$ as test)x)y where y.r` could cause some trouble if the column names
corresponding to `label_column_name`, `point_column_name` or `test_column_name`
are also named `x` or `y` (or even if there is another column called `y` in
`test_source`). The query would fail to execute due to ambiguity.
In general, when you are using temp names in a sql query, its always best
to using a unique string instead of something like x or y. MADlib has a python
function named `unique_string()` in module `utilities.utilities` which will
generate a unique string for you. I would suggest you apply this wherever you
are using a temp name/placeholder in the SQL queries used in knn.
> Initial implementation of k-NN
> ------------------------------
>
> Key: MADLIB-927
> URL: https://issues.apache.org/jira/browse/MADLIB-927
> Project: Apache MADlib
> Issue Type: New Feature
> Reporter: Rahul Iyer
> Labels: gsoc2016, starter
>
> k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors
> of data points in a metric feature space according to a specified distance
> function. It is considered one of the canonical algorithms of data science.
> It is a nonparametric method, which makes it applicable to a lot of
> real-world problems where the data doesn’t satisfy particular distribution
> assumptions. It can also be implemented as a lazy algorithm, which means
> there is no training phase where information in the data is condensed into
> coefficients, but there is a costly testing phase where all data (or some
> subset) is used to make predictions.
> This JIRA involves implementing the naïve approach - i.e. compute the k
> nearest neighbors by going through all points.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)