[jira] [Comment Edited] (MADLIB-1129) Additional output information for k-NN

Frank McQuillan (JIRA) Wed, 23 Aug 2017 14:53:26 -0700

    [ 
https://issues.apache.org/jira/browse/MADLIB-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139202#comment-16139202
 ]


Frank McQuillan edited comment on MADLIB-1129 at 8/23/17 9:52 PM:
------------------------------------------------------------------

I would suggest that the list of k-nearest neighbors that were used in the 
voting/averaging should be an array of INTEGERs that are the ids of the 
point_source table.

Current output:
{code}
id |  data   | prediction 
----+---------+------------
  1 | {2,1}   |          1
  2 | {2,6}   |          1
  3 | {15,40} |          0
  4 | {12,1}  |          1
  5 | {2,90}  |          0
  6 | {50,45} |          0
(6 rows)
{code}

New proposed output for k=4 (with dummy data):
{code}
id |  data   | prediction | k_nearest_neighbors
----+---------+------------+-------------------
  1 | {2,1}   |          1 | {2,8,6,3}
  2 | {2,6}   |          1 | {2,1,3,4}
  3 | {15,40} |          0 | {1,2,6,6}
  4 | {12,1}  |          1 | {2,8,6,3}
  5 | {2,90}  |          0 | {7,5,4,3}
  6 | {50,45} |          0 | {2,8,6,3}
(6 rows)
{code}
where the k_nearest_neighbors are sorted in ascending order from closest to 
furthest as per the distance measure used.

It means we need to change the interface:

The current interface is:
{code}
knn( point_source,
     point_column_name,
     label_column_name,
     test_source,
     test_column_name,
     id_column_name,
     output_table,
     operation,
     k
   )
{code}

The new proposed interface is:
{code}
knn( point_source,
     point_column_name,
     point_id,
     label_column_name,
     test_source,
     test_column_name,
     test_id,
     output_table,
     operation,
     k,
     output_neighbors
   )
{code}

{code}
point_id
TEXT, default = 'id'. Name of the column in 'point_source’ containing source 
data ids. The ids are of type INTEGER with no duplicates. They do not need to 
be contiguous.
{code}

{code}
output_neighbors (optional)
BOOLEAN default: FALSE. Outputs the list of k-nearest neighbors that were used 
in the voting/averaging.
{code}

Also notice that I renamed parameter id_column_name to test_id for clarity.

I would ask others to comment on this proposed interface change [~riyer] 
[~njayaram] [~okislal] for example








was (Author: fmcquillan):
I would suggest that the list of k-nearest neighbors that were used in the 
voting/averaging should be an array of INTEGERs that are the ids of the 
point_source table.

Current output:
{code}
id |  data   | prediction 
----+---------+------------
  1 | {2,1}   |          1
  2 | {2,6}   |          1
  3 | {15,40} |          0
  4 | {12,1}  |          1
  5 | {2,90}  |          0
  6 | {50,45} |          0
(6 rows)
{code}

New proposed output for k=4 (with dummy data):
{code}
id |  data   | prediction | k_nearest_neighbors
----+---------+------------+-------------------
  1 | {2,1}   |          1 | {2,8,6,3}
  2 | {2,6}   |          1 | {2,1,3,4}
  3 | {15,40} |          0 | {1,2,6,6}
  4 | {12,1}  |          1 | {2,8,6,3}
  5 | {2,90}  |          0 | {7,5,4,3}
  6 | {50,45} |          0 | {2,8,6,3}
(6 rows)
{code}
where the k_nearest_neighbors are sorted in ascending order from closest to 
furthest as per the distance measure used.

It means we need to change the interface:

The current interface is:
{code}
knn( point_source,
     point_column_name,
     label_column_name,
     test_source,
     test_column_name,
     id_column_name,
     output_table,
     operation,
     k
   )
{code}

The new proposed interface is:
{code}
knn( point_source,
     point_column_name,
     point_id,
     label_column_name,
     test_source,
     test_column_name,
     test_id,
     output_table,
     operation,
     k,
     output_neighbors
   )
{code}

{code}
point_id
TEXT, default = 'id'. Name of the column in 'point_source’ containing source 
data ids. The ids are of type INTEGER with no duplicates. They do not need to 
be contiguous.
{code}

{code}
output_neighbors (optional)
BOOLEAN default: FALSE. Outputs the list of k-nearest neighbors that were used 
in the voting/averaging.
{code}

Also notice that I renamed parameter id_column_name to test_id for clarity.







> Additional output information for k-NN
> --------------------------------------
>
>                 Key: MADLIB-1129
>                 URL: https://issues.apache.org/jira/browse/MADLIB-1129
>             Project: Apache MADlib
>          Issue Type: Improvement
>          Components: k-NN
>            Reporter: Frank McQuillan
>            Priority: Minor
>              Labels: starter
>             Fix For: v2.0
>
>
> Follow on to
> https://issues.apache.org/jira/browse/MADLIB-927
> List the k-nearest neighbors that were used in the voting/averaging, sorted 
> in ASC order according to the distance function used.  This could be added to 
> the current output table.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (MADLIB-1129) Additional output information for k-NN

Reply via email to