[jira] [Commented] (MADLIB-1129) Additional output information for k-NN

Frank McQuillan (JIRA) Fri, 25 Aug 2017 11:22:18 -0700

    [ 
https://issues.apache.org/jira/browse/MADLIB-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141980#comment-16141980
 ]


Frank McQuillan commented on MADLIB-1129:
-----------------------------------------


Here is another crack at it:

{code}
knn( point_source,
     point_column_name,
     point_id,
     label_column_name,
     test_source,
     test_column_name,
     test_id,
     output_table,
     k,                     —- optional
     output_neighbors       —- optional
   )
{code}
where
{code}
point_source
TEXT. Name of the table containing the training data points.  Training data 
points are expected to be stored row-wise in a column of type DOUBLE 
PRECISION[].

point_column_name
TEXT. Name of the column with training data points.

point_id
TEXT. Name of the column in 'point_source’ containing source data ids. The ids 
are of type INTEGER with no duplicates. They do not need to be contiguous.  
This parameter must be used if the list of nearest neighbors are to be output, 
i.e., if the parameter ‘output_neighbors’ below is TRUE or if 
‘label_column_name’ is NULL.

label_column_name
TEXT. Name of the column with labels/values of training data points.  If 
Boolean, integer or text types will run knn classification, else if double 
precision values will run knn regression.  If you set this to NULL will return 
neighbors only without doing classification or regression.

test_source
TEXT. Name of the table containing the test data points.  Testing data points 
are expected to be stored row-wise in a column of type DOUBLE PRECISION[].

test_column_name
TEXT. Name of the column with testing data points.

test_id
TEXT. Name of the column having ids of data points in test data table.

output_table
TEXT. Name of the table to store final results.

k (optional)
INTEGER. default: 1. Number of nearest neighbors to consider. For 
classification, should be an odd number to break ties otherwise result may 
depend on ordering of the input data.

output_neighbors (optional)
BOOLEAN default: FALSE. Outputs the list of k-nearest neighbors that were used 
in the voting/averaging.
{code}

So for Scott’s use case the SELECT statement would be:

{code}
SELECT * FROM madlib.knn( 
        ‘point_source’,
        ‘point_column_name’,
        ‘point_id’,
        NULL,
        ‘test_source’,
        ‘test_column_name’,
        ‘test_id’,
        ‘output_table’,
        3,
        TRUE
   )
{code}



> Additional output information for k-NN
> --------------------------------------
>
>                 Key: MADLIB-1129
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1129
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: k-NN
>            Reporter: Frank McQuillan
>            Assignee: Himanshu Pandey
>            Priority: Minor
>              Labels: starter
>             Fix For: v2.0
>
>
> Follow on to
> https://issues.apache.org/jira/browse/MADLIB-927
> List the k-nearest neighbors that were used in the voting/averaging, sorted 
> in ASC order according to the distance function used.  This could be added to 
> the current output table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (MADLIB-1129) Additional output information for k-NN

Reply via email to