fmcquillan99 edited a comment on issue #441: Kmeans: simplified silhouette per 
point for k-means
URL: https://github.com/apache/madlib/pull/441#issuecomment-532483424
 
 
   This looks really nice now, here are a few related tests I ran on k-means 
auto and this PR
   
   
   (0)
   silh by points with WHERE clause
   ```
   SELECT * FROM madlib.simple_silhouette_points( 'km_sample',          -- 
Input points table
                                                 'km_points_silh',     -- 
Output table
                                                 'pid',                -- Point 
ID column in input table
                                                 'points',             -- 
Points column in input table
                                                 (SELECT centroids FROM 
km_result_auto WHERE k=3), -- centroids array
                                                 'madlib.squared_dist_norm2'   
-- Distance function
                                         );
   SELECT * FROM km_points_silh ORDER BY pid;
   
   pid | centroid_id | neighbor_centroid_id |       silh
   -----+-------------+----------------------+-------------------
      1 |           2 |                    0 | 0.800019825058391
      2 |           2 |                    0 | 0.786712987562913
      3 |           0 |                    2 | 0.867496080386644
      4 |           1 |                    0 | 0.995466498952947
      5 |           2 |                    0 | 0.761551610381542
      6 |           1 |                    0 | 0.993950286967157
      7 |           0 |                    1 | 0.960621637528528
      8 |           0 |                    1 | 0.941493577405087
      9 |           2 |                    0 | 0.925822020308802
     10 |           2 |                    0 |  0.92536421766532
   (10 rows)
   ```
   
   (1)
   duplicate k values
   ```
   DROP TABLE IF EXISTS km_result_auto1, km_result_auto_summary1;
   SELECT madlib.kmeans_random_auto(
       'km_sample',                   -- points table
       'km_result_auto1',              -- output table
       'points',                      -- column name in point table
       ARRAY[2,2,4,5,6],              -- k values to try
       'madlib.squared_dist_norm2',   -- distance function
       'madlib.avg',                  -- aggregate function
       20,                            -- max iterations
       0.001,                         -- minimum fraction of centroids 
reassigned to continue iterating
       'both'                         -- k selection algorithm  (simple 
silhouette and elbow)
   );
   SELECT * FROM km_result_auto_summary1;
   
   ERROR:  plpy.Error: kmeans_auto: Duplicate values are not allowed in k. 
(plpython.c:5038)
   CONTEXT:  Traceback (most recent call last):
     PL/Python function "kmeans_random_auto", line 21, in <module>
       return kmeans_auto.kmeans_random_auto(**globals())
     PL/Python function "kmeans_random_auto", line 209, in kmeans_random_auto
     PL/Python function "kmeans_random_auto", line 77, in kmeans_auto
     PL/Python function "kmeans_random_auto", line 49, in _validate
     PL/Python function "kmeans_random_auto", line 96, in _assert
   PL/Python function "kmeans_random_auto"
   ```
   
   (2)
   zero distance on closest columns, random test
   ```
   SELECT cent.*,  (madlib.closest_column(centroids, unnest_result, 
'madlib.squared_dist_norm2')).*
   FROM km_centroids_unnest as cent, km_result
   ORDER BY cent.unnest_row_id;
   
    unnest_row_id |                                                             
          unnest_result
                         | column_id | distance
   
---------------+--------------------------------------------------------------------------------------------------------------------------------------
   ----------------------+-----------+----------
                1 | 
{14.255,1.9325,2.5025,16.05,110.5,3.055,2.9775,0.2975,1.845,6.2125,0.9975,3.365,1378.75}
                         |         0 |        0
                2 | 
{13.7533333333333,1.905,2.425,16.0666666666667,90.3333333333333,2.805,2.98,0.29,2.005,5.40663333333333,1.04166666666667,3.31833333333
   333,1020.83333333333} |         1 |        0
   (2 rows)
   ```
   
   (3)
   silh by points, no WHERE clause
   ```
   SELECT * FROM madlib.simple_silhouette_points( 'km_sample',          -- 
Input points table
                                                 'km_points_silh',      -- 
Output table
                                                 'pid',                 -- 
Point ID column in input table
                                                 'points',              -- 
Points column in input table
                                                 'km_result',           -- 
Centroids table
                                                 'centroids',           -- 
Column in centroids table containing centroids
                                                 'madlib.squared_dist_norm2'   
-- Distance function
                                         );
   
   pid | centroid_id | neighbor_centroid_id |       silh
   -----+-------------+----------------------+-------------------
      1 |           1 |                    0 | 0.966608819821713
      2 |           1 |                    0 | 0.926251125077039
      3 |           1 |                    0 |  0.28073008848306
      4 |           0 |                    1 | 0.951447083910869
      5 |           1 |                    0 |  0.80098239014753
      6 |           0 |                    1 | 0.972487557020722
      7 |           0 |                    1 |  0.88838568723116
      8 |           0 |                    1 | 0.906334719972002
      9 |           1 |                    0 | 0.994315140692314
     10 |           1 |                    0 |  0.99420347703982
   (10 rows)
   ```
   
   LGTM

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to