[
https://issues.apache.org/jira/browse/MADLIB-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16141980#comment-16141980
]
Frank McQuillan commented on MADLIB-1129:
-----------------------------------------
Here is another crack at it:
{code}
knn( point_source,
point_column_name,
point_id,
label_column_name,
test_source,
test_column_name,
test_id,
output_table,
k, —- optional
output_neighbors —- optional
)
{code}
where
{code}
point_source
TEXT. Name of the table containing the training data points. Training data
points are expected to be stored row-wise in a column of type DOUBLE
PRECISION[].
point_column_name
TEXT. Name of the column with training data points.
point_id
TEXT. Name of the column in 'point_source’ containing source data ids. The ids
are of type INTEGER with no duplicates. They do not need to be contiguous.
This parameter must be used if the list of nearest neighbors are to be output,
i.e., if the parameter ‘output_neighbors’ below is TRUE or if
‘label_column_name’ is NULL.
label_column_name
TEXT. Name of the column with labels/values of training data points. If
Boolean, integer or text types will run knn classification, else if double
precision values will run knn regression. If you set this to NULL will return
neighbors only without doing classification or regression.
test_source
TEXT. Name of the table containing the test data points. Testing data points
are expected to be stored row-wise in a column of type DOUBLE PRECISION[].
test_column_name
TEXT. Name of the column with testing data points.
test_id
TEXT. Name of the column having ids of data points in test data table.
output_table
TEXT. Name of the table to store final results.
k (optional)
INTEGER. default: 1. Number of nearest neighbors to consider. For
classification, should be an odd number to break ties otherwise result may
depend on ordering of the input data.
output_neighbors (optional)
BOOLEAN default: FALSE. Outputs the list of k-nearest neighbors that were used
in the voting/averaging.
{code}
So for Scott’s use case the SELECT statement would be:
{code}
SELECT * FROM madlib.knn(
‘point_source’,
‘point_column_name’,
‘point_id’,
NULL,
‘test_source’,
‘test_column_name’,
‘test_id’,
‘output_table’,
3,
TRUE
)
{code}
> Additional output information for k-NN
> --------------------------------------
>
> Key: MADLIB-1129
> URL: https://issues.apache.org/jira/browse/MADLIB-1129
> Project: Apache MADlib
> Issue Type: Improvement
> Components: k-NN
> Reporter: Frank McQuillan
> Assignee: Himanshu Pandey
> Priority: Minor
> Labels: starter
> Fix For: v2.0
>
>
> Follow on to
> https://issues.apache.org/jira/browse/MADLIB-927
> List the k-nearest neighbors that were used in the voting/averaging, sorted
> in ASC order according to the distance function used. This could be added to
> the current output table.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)