[
https://issues.apache.org/jira/browse/MADLIB-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139202#comment-16139202
]
Frank McQuillan edited comment on MADLIB-1129 at 8/23/17 9:52 PM:
------------------------------------------------------------------
I would suggest that the list of k-nearest neighbors that were used in the
voting/averaging should be an array of INTEGERs that are the ids of the
point_source table.
Current output:
{code}
id | data | prediction
----+---------+------------
1 | {2,1} | 1
2 | {2,6} | 1
3 | {15,40} | 0
4 | {12,1} | 1
5 | {2,90} | 0
6 | {50,45} | 0
(6 rows)
{code}
New proposed output for k=4 (with dummy data):
{code}
id | data | prediction | k_nearest_neighbors
----+---------+------------+-------------------
1 | {2,1} | 1 | {2,8,6,3}
2 | {2,6} | 1 | {2,1,3,4}
3 | {15,40} | 0 | {1,2,6,6}
4 | {12,1} | 1 | {2,8,6,3}
5 | {2,90} | 0 | {7,5,4,3}
6 | {50,45} | 0 | {2,8,6,3}
(6 rows)
{code}
where the k_nearest_neighbors are sorted in ascending order from closest to
furthest as per the distance measure used.
It means we need to change the interface:
The current interface is:
{code}
knn( point_source,
point_column_name,
label_column_name,
test_source,
test_column_name,
id_column_name,
output_table,
operation,
k
)
{code}
The new proposed interface is:
{code}
knn( point_source,
point_column_name,
point_id,
label_column_name,
test_source,
test_column_name,
test_id,
output_table,
operation,
k,
output_neighbors
)
{code}
{code}
point_id
TEXT, default = 'id'. Name of the column in 'point_source’ containing source
data ids. The ids are of type INTEGER with no duplicates. They do not need to
be contiguous.
{code}
{code}
output_neighbors (optional)
BOOLEAN default: FALSE. Outputs the list of k-nearest neighbors that were used
in the voting/averaging.
{code}
Also notice that I renamed parameter id_column_name to test_id for clarity.
I would ask others to comment on this proposed interface change [~riyer]
[~njayaram] [~okislal] for example
was (Author: fmcquillan):
I would suggest that the list of k-nearest neighbors that were used in the
voting/averaging should be an array of INTEGERs that are the ids of the
point_source table.
Current output:
{code}
id | data | prediction
----+---------+------------
1 | {2,1} | 1
2 | {2,6} | 1
3 | {15,40} | 0
4 | {12,1} | 1
5 | {2,90} | 0
6 | {50,45} | 0
(6 rows)
{code}
New proposed output for k=4 (with dummy data):
{code}
id | data | prediction | k_nearest_neighbors
----+---------+------------+-------------------
1 | {2,1} | 1 | {2,8,6,3}
2 | {2,6} | 1 | {2,1,3,4}
3 | {15,40} | 0 | {1,2,6,6}
4 | {12,1} | 1 | {2,8,6,3}
5 | {2,90} | 0 | {7,5,4,3}
6 | {50,45} | 0 | {2,8,6,3}
(6 rows)
{code}
where the k_nearest_neighbors are sorted in ascending order from closest to
furthest as per the distance measure used.
It means we need to change the interface:
The current interface is:
{code}
knn( point_source,
point_column_name,
label_column_name,
test_source,
test_column_name,
id_column_name,
output_table,
operation,
k
)
{code}
The new proposed interface is:
{code}
knn( point_source,
point_column_name,
point_id,
label_column_name,
test_source,
test_column_name,
test_id,
output_table,
operation,
k,
output_neighbors
)
{code}
{code}
point_id
TEXT, default = 'id'. Name of the column in 'point_source’ containing source
data ids. The ids are of type INTEGER with no duplicates. They do not need to
be contiguous.
{code}
{code}
output_neighbors (optional)
BOOLEAN default: FALSE. Outputs the list of k-nearest neighbors that were used
in the voting/averaging.
{code}
Also notice that I renamed parameter id_column_name to test_id for clarity.
> Additional output information for k-NN
> --------------------------------------
>
> Key: MADLIB-1129
> URL: https://issues.apache.org/jira/browse/MADLIB-1129
> Project: Apache MADlib
> Issue Type: Improvement
> Components: k-NN
> Reporter: Frank McQuillan
> Priority: Minor
> Labels: starter
> Fix For: v2.0
>
>
> Follow on to
> https://issues.apache.org/jira/browse/MADLIB-927
> List the k-nearest neighbors that were used in the voting/averaging, sorted
> in ASC order according to the distance function used. This could be added to
> the current output table.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)