[
https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15768621#comment-15768621
]
ASF GitHub Bot commented on MADLIB-927:
---------------------------------------
Github user orhankislal commented on a diff in the pull request:
https://github.com/apache/incubator-madlib/pull/81#discussion_r93346785
--- Diff: src/ports/postgres/modules/knn/knn.sql_in ---
@@ -0,0 +1,165 @@
+/* -----------------------------------------------------------------------
*//**
+ *
+ * @file knn.sql_in
+ *
+ * @brief Set of functions for k-nearest neighbors.
+ *
+ *
+ *//*
----------------------------------------------------------------------- */
+
+m4_include(`SQLCommon.m4')
+
+DROP TYPE IF EXISTS MADLIB_SCHEMA.knn_result CASCADE;
+CREATE TYPE MADLIB_SCHEMA.knn_result AS (
+ prediction float
+);
+DROP TYPE IF EXISTS MADLIB_SCHEMA.test_table_spec CASCADE;
+CREATE TYPE MADLIB_SCHEMA.test_table_spec AS (
+ id integer,
+ vector DOUBLE PRECISION[]
+);
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__knn_validate_src(
+rel_source VARCHAR
+) RETURNS VOID AS $$
+ PythonFunction(knn, knn, knn_validate_src)
+$$ LANGUAGE plpythonu
+m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
+
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
+ arg1 VARCHAR
+) RETURNS VOID AS $$
+BEGIN
+ IF arg1 = 'help' THEN
+ RAISE NOTICE 'You need to enter following arguments in order:
+ Argument 1: Training data table having training features as vector
column and labels
+ Argument 2: Name of column having feature vectors in training data table
+ Argument 3: Name of column having actual label/vlaue for corresponding
feature vector in training data table
+ Argument 4: Test data table having features as vector column. Id of
features is mandatory
+ Argument 5: Name of column having feature vectors in test data table
+ Argument 6: Name of column having feature vector Ids in test data table
+ Argument 7: Name of output table
+ Argument 8: c for classification task, r for regression task
+ Argument 9: value of k. Default will go as 1';
+ END IF;
+END;
+$$ LANGUAGE plpgsql VOLATILE
+m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
+) RETURNS VOID AS $$
+BEGIN
+ EXECUTE $sql$ select * from MADLIB_SCHEMA.knn('help') $sql$;
+END;
+$$ LANGUAGE plpgsql VOLATILE
+m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
+
+
+CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
+ point_source VARCHAR,
+ point_column_name VARCHAR,
+ label_column_name VARCHAR,
+ test_source VARCHAR,
+ test_column_name VARCHAR,
+ id_column_name VARCHAR,
+ output_table VARCHAR,
+ operation VARCHAR,
+ k INTEGER
+) RETURNS VARCHAR AS $$
+DECLARE
+ class_test_source REGCLASS;
+ class_point_source REGCLASS;
+ l FLOAT;
+ id INTEGER;
+ vector DOUBLE PRECISION[];
+ cur_pid integer;
+ theResult MADLIB_SCHEMA.knn_result;
+ r MADLIB_SCHEMA.test_table_spec;
+ oldClientMinMessages VARCHAR;
+ returnstring VARCHAR;
+BEGIN
+ oldClientMinMessages :=
+ (SELECT setting FROM pg_settings WHERE name =
'client_min_messages');
+ EXECUTE 'SET client_min_messages TO warning';
+ PERFORM MADLIB_SCHEMA.__knn_validate_src(test_source);
+ PERFORM MADLIB_SCHEMA.__knn_validate_src(point_source);
+ class_test_source := test_source;
+ class_point_source := point_source;
+ --checks
+ IF (k <= 0) THEN
+ RAISE EXCEPTION 'KNN error: Number of neighbors k must be a
positive integer.';
+ END IF;
+ IF (operation != 'c' AND operation != 'r') THEN
+ RAISE EXCEPTION 'KNN error: put r for regression OR c for
classification.';
+ END IF;
+ PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
+
+ EXECUTE format('DROP TABLE IF EXISTS %I',output_table);
+ EXECUTE format('CREATE TABLE %I(%I integer, %I DOUBLE PRECISION[],
predlabel float)',output_table,id_column_name,test_column_name);
+
+
+ FOR r IN EXECUTE format('SELECT %I,%I FROM %I', id_column_name,
test_column_name, test_source)
+ LOOP
+ cur_pid := r.id;
+ vector := r.vector;
+ EXECUTE
+ $sql$
+ DROP TABLE IF EXISTS pg_temp.knn_vector;
--- End diff --
If we want to use a specific table name (and not the unique_string function
that MADlib has), we should add a prefix (\_\_madlib__knn_vector\_\_) to make
sure it is not used by someone else.
> Initial implementation of k-NN
> ------------------------------
>
> Key: MADLIB-927
> URL: https://issues.apache.org/jira/browse/MADLIB-927
> Project: Apache MADlib
> Issue Type: New Feature
> Reporter: Rahul Iyer
> Labels: gsoc2016, starter
>
> k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors
> of data points in a metric feature space according to a specified distance
> function. It is considered one of the canonical algorithms of data science.
> It is a nonparametric method, which makes it applicable to a lot of
> real-world problems where the data doesn’t satisfy particular distribution
> assumptions. It can also be implemented as a lazy algorithm, which means
> there is no training phase where information in the data is condensed into
> coefficients, but there is a costly testing phase where all data (or some
> subset) is used to make predictions.
> This JIRA involves implementing the naïve approach - i.e. compute the k
> nearest neighbors by going through all points.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)