fmcquillan99 commented on issue #441: Kmeans: simplified silhouette per point for k-means URL: https://github.com/apache/madlib/pull/441#issuecomment-532483424 This looks really nice now, here are a few related tests I ran on k-means auto and this PR (0) silh by points ``` SELECT * FROM madlib.simple_silhouette_points( 'km_sample', -- Input points table 'km_points_silh', -- Output table 'pid', -- Point ID column in input table 'points', -- Points column in input table (SELECT centroids FROM km_result_auto WHERE k=3), -- centroids array 'madlib.squared_dist_norm2' -- Distance function ); SELECT * FROM km_points_silh ORDER BY pid; pid | centroid_id | neighbor_centroid_id | silh -----+-------------+----------------------+------------------- 1 | 2 | 0 | 0.800019825058391 2 | 2 | 0 | 0.786712987562913 3 | 0 | 2 | 0.867496080386644 4 | 1 | 0 | 0.995466498952947 5 | 2 | 0 | 0.761551610381542 6 | 1 | 0 | 0.993950286967157 7 | 0 | 1 | 0.960621637528528 8 | 0 | 1 | 0.941493577405087 9 | 2 | 0 | 0.925822020308802 10 | 2 | 0 | 0.92536421766532 (10 rows) ``` (1) duplicate k values ``` DROP TABLE IF EXISTS km_result_auto1, km_result_auto_summary1; SELECT madlib.kmeans_random_auto( 'km_sample', -- points table 'km_result_auto1', -- output table 'points', -- column name in point table ARRAY[2,2,4,5,6], -- k values to try 'madlib.squared_dist_norm2', -- distance function 'madlib.avg', -- aggregate function 20, -- max iterations 0.001, -- minimum fraction of centroids reassigned to continue iterating 'both' -- k selection algorithm (simple silhouette and elbow) ); SELECT * FROM km_result_auto_summary1; ERROR: plpy.Error: kmeans_auto: Duplicate values are not allowed in k. (plpython.c:5038) CONTEXT: Traceback (most recent call last): PL/Python function "kmeans_random_auto", line 21, in <module> return kmeans_auto.kmeans_random_auto(**globals()) PL/Python function "kmeans_random_auto", line 209, in kmeans_random_auto PL/Python function "kmeans_random_auto", line 77, in kmeans_auto PL/Python function "kmeans_random_auto", line 49, in _validate PL/Python function "kmeans_random_auto", line 96, in _assert PL/Python function "kmeans_random_auto" ``` (2) zero distance on closest columns, random test ``` SELECT cent.*, (madlib.closest_column(centroids, unnest_result, 'madlib.squared_dist_norm2')).* FROM km_centroids_unnest as cent, km_result ORDER BY cent.unnest_row_id; unnest_row_id | unnest_result | column_id | distance ---------------+-------------------------------------------------------------------------------------------------------------------------------------- ----------------------+-----------+---------- 1 | {14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75} | 0 | 0 2 | {13.7533333333333,1.905,2.425,16.0666666666667,90.3333333333333,2.805,2.98,0.29,2.005,5.40663333333333,1.04166666666667,3.31833333333 333,1020.83333333333} | 1 | 0 (2 rows) ``` LGTM
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
