kaknikhil commented on a change in pull request #441: Kmeans: simplified
silhouette per point for k-means
URL: https://github.com/apache/madlib/pull/441#discussion_r325328731
##########
File path: src/ports/postgres/modules/kmeans/test/kmeans.sql_in
##########
@@ -81,11 +81,37 @@ COPY km_sample (pid, points) FROM stdin DELIMITER '|';
10 | {13.86, 1.35, 2.27, 16, 98, 2.98, 3.15, 0.22, 1.8500, 7.2199, 1.01, NULL,
1045}
\.
+DROP TABLE IF EXISTS centroids_null, silh_out;
+CREATE TABLE centroids_null AS
SELECT * FROM kmeanspp('km_sample', 'points', 2,
'MADLIB_SCHEMA.squared_dist_norm2',
'MADLIB_SCHEMA.avg', 20, 0.001);
+SELECT simple_silhouette_points('km_sample', 'silh_out', 'pid', 'points',
+ 'centroids_null', 'centroids',
+ 'MADLIB_SCHEMA.squared_dist_norm2');
+
+SELECT assert(silh > 0, 'Incorrect silhouette value')
+FROM silh_out
+WHERE silh IS NOT NULL;
+
+DROP TABLE IF EXISTS silh_out;
+SELECT simple_silhouette_points('km_sample', 'silh_out', 'pid', 'points',
+ 'centroids_null', 'centroids');
+
+SELECT assert(count(*) = 9, 'Incorrect silhouette count')
Review comment:
1. Since the silhouette calculation is deterministic, if we create our own
centroids table , can't we assert the actual values for all the 9 points ?
* If we do this, then we might also be able to add a test for the case
when distances[2] == 0, what do you think ?
1. We can add a few more asserts here
* `pid` values should be the same as the source table
* `centroid_id` and `neighbor_centroid_id` should have different values
for each pid
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services