[ 
https://issues.apache.org/jira/browse/MADLIB-927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15772958#comment-15772958
 ] 

ASF GitHub Bot commented on MADLIB-927:
---------------------------------------

Github user auonhaidar commented on a diff in the pull request:

    https://github.com/apache/incubator-madlib/pull/81#discussion_r93768540
  
    --- Diff: src/ports/postgres/modules/knn/knn.sql_in ---
    @@ -0,0 +1,165 @@
    +/* ----------------------------------------------------------------------- 
*//**
    + *
    + * @file knn.sql_in
    + *
    + * @brief Set of functions for k-nearest neighbors.
    + *
    + *
    + *//* 
----------------------------------------------------------------------- */
    +
    +m4_include(`SQLCommon.m4')
    +
    +DROP TYPE IF EXISTS MADLIB_SCHEMA.knn_result CASCADE;
    +CREATE TYPE MADLIB_SCHEMA.knn_result AS (
    +    prediction float
    +);
    +DROP TYPE IF EXISTS MADLIB_SCHEMA.test_table_spec CASCADE;
    +CREATE TYPE MADLIB_SCHEMA.test_table_spec AS (
    +    id integer,
    +    vector DOUBLE PRECISION[]
    +);
    +
    +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.__knn_validate_src(
    +rel_source VARCHAR
    +) RETURNS VOID AS $$
    +    PythonFunction(knn, knn, knn_validate_src)
    +$$ LANGUAGE plpythonu
    +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
    +
    +
    +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
    +    arg1 VARCHAR
    +) RETURNS VOID AS $$
    +BEGIN
    +    IF arg1 = 'help' THEN
    +   RAISE NOTICE 'You need to enter following arguments in order:
    +   Argument 1: Training data table having training features as vector 
column and labels
    +   Argument 2: Name of column having feature vectors in training data table
    +   Argument 3: Name of column having actual label/vlaue for corresponding 
feature vector in training data table
    +   Argument 4: Test data table having features as vector column. Id of 
features is mandatory
    +   Argument 5: Name of column having feature vectors in test data table
    +   Argument 6: Name of column having feature vector Ids in test data table
    +   Argument 7: Name of output table
    +   Argument 8: c for classification task, r for regression task
    +   Argument 9: value of k. Default will go as 1';
    +    END IF;
    +END;
    +$$ LANGUAGE plpgsql VOLATILE
    +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
    +
    +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
    +) RETURNS VOID AS $$
    +BEGIN    
    +    EXECUTE $sql$ select * from MADLIB_SCHEMA.knn('help') $sql$;
    +END;
    +$$ LANGUAGE plpgsql VOLATILE
    +m4_ifdef(`__HAS_FUNCTION_PROPERTIES__', `READS SQL DATA', `');
    +
    +
    +CREATE OR REPLACE FUNCTION MADLIB_SCHEMA.knn(
    +    point_source VARCHAR,
    +    point_column_name VARCHAR,
    +    label_column_name VARCHAR,
    +    test_source VARCHAR,
    +    test_column_name VARCHAR,
    +    id_column_name VARCHAR,
    +    output_table VARCHAR,
    +    operation VARCHAR,
    +    k INTEGER
    +) RETURNS VARCHAR AS $$
    +DECLARE
    +    class_test_source REGCLASS;
    +    class_point_source REGCLASS;
    +    l FLOAT;
    +    id INTEGER;
    +    vector DOUBLE PRECISION[];
    +    cur_pid integer;
    +    theResult MADLIB_SCHEMA.knn_result;
    +    r MADLIB_SCHEMA.test_table_spec;
    +    oldClientMinMessages VARCHAR;
    +    returnstring VARCHAR;
    +BEGIN
    +    oldClientMinMessages :=
    +        (SELECT setting FROM pg_settings WHERE name = 
'client_min_messages');
    +    EXECUTE 'SET client_min_messages TO warning';
    +    PERFORM MADLIB_SCHEMA.__knn_validate_src(test_source);
    +    PERFORM MADLIB_SCHEMA.__knn_validate_src(point_source);
    +    class_test_source := test_source;
    +    class_point_source := point_source;
    +    --checks
    +    IF (k <= 0) THEN
    +        RAISE EXCEPTION 'KNN error: Number of neighbors k must be a 
positive integer.';
    +    END IF;
    +    IF (operation != 'c' AND operation != 'r') THEN
    +        RAISE EXCEPTION 'KNN error: put r for regression OR c for 
classification.';
    +    END IF;
    +    PERFORM MADLIB_SCHEMA.create_schema_pg_temp();
    +    
    +    EXECUTE format('DROP TABLE IF EXISTS %I',output_table);
    +    EXECUTE format('CREATE TABLE %I(%I integer, %I DOUBLE PRECISION[], 
predlabel float)',output_table,id_column_name,test_column_name);
    +   
    +    
    +    FOR r IN EXECUTE format('SELECT %I,%I FROM %I', id_column_name, 
test_column_name, test_source)
    +    LOOP
    +   cur_pid := r.id;
    +   vector := r.vector;
    +   EXECUTE
    +        $sql$
    +   DROP TABLE IF EXISTS pg_temp.knn_vector;
    --- End diff --
    
    Please elaborate on this.


> Initial implementation of k-NN
> ------------------------------
>
>                 Key: MADLIB-927
>                 URL: https://issues.apache.org/jira/browse/MADLIB-927
>             Project: Apache MADlib
>          Issue Type: New Feature
>            Reporter: Rahul Iyer
>              Labels: gsoc2016, starter
>
> k-Nearest Neighbors is a simple algorithm based on finding nearest neighbors 
> of data points in a metric feature space according to a specified distance 
> function. It is considered one of the canonical algorithms of data science. 
> It is a nonparametric method, which makes it applicable to a lot of 
> real-world problems where the data doesn’t satisfy particular distribution 
> assumptions. It can also be implemented as a lazy algorithm, which means 
> there is no training phase where information in the data is condensed into 
> coefficients, but there is a costly testing phase where all data (or some 
> subset) is used to make predictions.
> This JIRA involves implementing the naïve approach - i.e. compute the k 
> nearest neighbors by going through all points.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to