I asked Elias Krainski (the autor of skater()), who replied as copied inline below:

On Tue, 13 Sep 2016, Michael O'Donnell wrote:

Hi,

I am interested in calculating multiple statistics based on skater{spdep} results for a SpatialPointsDataFrame, and I was wondering if someone could help me verify that what I have done is correct (Q1).

My objective is to evaluate the performance of the clustering while using different parameters for different skater() runs. Specifically, I am not sure how to measure the within-group similarity and I believe the other statistics are defined correctly.

Also, can someone provide more details on the objects "not.prune" and "candidates" (Q2)?

not.prune is the set of edges that if once pruned generate groups that does not follows the restriction. For example, when you want to have groups with at least 10 areas and at some point a group stop to be considered to be pruned due this.



Q1 ------------------------------ These are the statistics that I would
like to calculate:
res1 <- skater() # Example of skater object

# The sum of the between-group dissimilarity
sst <- res1$ssto

# The within-group similarity
sse <- sum(res1$ssw)/max(res1$groups)

SSW is the sum of homogeneity at each step of the SKATER algorithm. So the first number coincides with SSTO, the second is for the case of two groups, the third for the case of three groups and so on. That is it has length equal the number of clusters. However, res1$groups is the identification of each area to with group it belongs to and has length equals the number of areas. So, it doesn't makes sense to divide sum(res1$ssw) to the number of groups. You may want res1$ssw/1:length(res1$ssw)



# R2
R2 <- (sst-sse)/sst

Is it the case to compute some kind of gain when having groups? The gain can be the difference between consecutive partitions, like diff(res1$ssw)


# AIC,AICc
# AIC = n*log(SSD/n)+2*cov_count
# AICc = AIC + 2*cov_count(cov_count+1)/(n-cov_count-1))
cov_count <- 1 # Number of covariates considered by skater and provided in
data
n_count <- nrow(shape2) # Node count
aic <- (n_count * log(sst)/(n_count) + 2.0 * cov_count)
aicc <- aic + 2.0 * cov_count * (cov_count + 1.0)/(n_count - cov_count -
1.0)

I'm not sure about this anymore...


# Calinski-Harabasz pseudo F-statistic
nc <- max(res1$groups)
n <- nrow(shape2)
fstat = (R2 / (nc - 1)) / ((1 - R2) / (n - nc))

It will be useful to consider the function index.G1 from the clusterSim package.


# Review
print(c(aic, aicc, fstat, R2))

Q2 ------------------------------
Define "not.prune" and "candidates"

For example, are candidates a list of cluster groups that are statistically significant while not.prune is a list of nodes that did not get assigned to a group. I have not been able to locate enough documentation on these objects and I am not sure how to interpret.

No. We haven't considered any kind of statistical test. As I mentioned above, the not.prune are those that doesn't matches the criteria (about size of the cluster).

Elias


Thank you for your assistance,
Mike



--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 91 00
e-mail: roger.biv...@nhh.no
http://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
http://depsy.org/person/434412

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Reply via email to