Re: [GRASS-dev] how to determine best k in a set of unsupervised classifications?

Nikos Alexandris Wed, 31 Oct 2018 04:26:35 -0700

* Veronica Andreo <[email protected]> [2018-10-31 00:23:57 +0100]:

Hi devs,


Hi Vero,

(not a real dev, but I'll share what I think)

I'm writing to ask how do one determine the best number of classes/clusters
in a set of unsupervised classifications with different k in GRASS?


You already know better than me I guess, but I'd like to refresh my mind
on all this a bit.

I guess the only way to tell if the number of classes is "best", is to
judge yourself by inspecting the "quality" of the clusters returned.

One way to tell would be to compute the "error of clusters" which would
be the overall distance between the points that are assigned to a
cluster and its center.  I guess comparing the overall errors between
different clustering settings (or even algorithms?), would give an idea
about how close points are around the centers of clusters.
Maybe we could implement something like this.

(All this I practiced during an generic Algorithmic Thinking course.  I
guess it's applicable in our "domain" too.)

I use i.cluster with different number of classes and then i.maxlik that uses a
modified version of k-means according to the manual page. Now, I would like
to know which unsup classif is the best within the set.


Sorry, I guess I have to read up:  what is "unsup classif"?

I check the i.cluster reports (looking for separability) and then explored the
rejection maps. But none of those seems to work as a crisp and clear
indicator. BTW, does anyone know which separability index does i.cluster
use?



I am interested to learn about the distance measure too.  I am looking
at the source code of `i.cluster`.  And then, searching around, I think
it's this file:

grasstrunk/lib/cluster/c_sep.c

and I/we just need to identify which distance it measures.

Nikos

In any case, I have seen some indices elsewhere (mainly R and Python) that
are used to choose the best clustering results (coming from the same or
different clustering methods). Examples of those indices are Silhouette,
Dunn, etc. Some are called internal as they do not require test data and
just characterize the compactness of clusters. On the other hand, the ones
requiring test data are called external. I have seen them in dtwclust R
package [0] (the package is oriented to time series clustering but
validation indices are more general) and in scikit-learn in Python [1].
Does any of you have something already implemented in this direction? or
how do you assess your unsup classification (clustering) results?

Any ideas or suggestions within GRASS?

Thanks much in advance!
Vero

[0] https://rdrr.io/cran/dtwclust/man/cvi.html
[1]
http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation

_______________________________________________
grass-dev mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/grass-dev



--
Nikos Alexandris | Remote Sensing & Geomatics

GPG Key Fingerprint 6F9D4506F3CA28380974D31A9053534B693C4FB3

signature.asc
Description: PGP signature

_______________________________________________
grass-dev mailing list
[email protected]
https://lists.osgeo.org/mailman/listinfo/grass-dev

Re: [GRASS-dev] how to determine best k in a set of unsupervised classifications?

Reply via email to