On Thu, 2013-05-30 at 20:36 +0000, Hall, Kyle wrote: > First time poster, please forgive me for errors. > > I have a data set of 23 sites with 145 different species counts for > macroinvertebrate communities for a given year. each species is > represented at least once per site and there are a lot of 0's for some > species. I have been applying a variety of vegan functions to the data > set to get a better understanding of the structure and I would like to > classify sites based on species using randomForest. My thought is that > this will give me a more understandable classification based on > species that I can use to cluster my sites and also see which species > are of more importance in classification. > > Question 1. Am I barking up the wrong...tree (pun intended) with > randomForest for this purpose?
Yes - I doubt unsupervised RF would give you anything more than you could get from a suitably-chosen dissimilarity matrix or even ordination to check if you actually have clusters. For supervised RF, even if you had a classification, you have far to few sites/samples to warrant a machine learning tool. > With PCA two sites are typically separated from the rest but the other > 21 sites show no discernible structure; spread like white noise over > both axes. > When I perform NMDS I tend to get a shot gun look and there are not > tight groupings on these reduced axes. However with Wards clustering > (in hclust) I do see some clusters plotting heavily to one side of an > axis or another (albeit spread wide on the orthogonal axis). Ward's clustering, as with any clustering, will find clusters - it [Ward's method] tends to find compact, spherical ones IIRC and hence often looks convincing. Your job is to demonstrate that the clustering into k cluster explains more of the variance in the model than no clustering. Simply eye-balling the dendrogram is not a solution to this. > Question 2. Is it possible that my data set just doesn't have enough > structure to neatly classify Sites by species count or am I simply a > newbie that is applying randomForest incorrectly? With so few data I wouldn't both with machine learning tools - they are designed to work with hundreds and thousands or more samples. HTH G > Example of data structure: > Site ABLA.MAL ABLA.PAR ACEN.SPP ACRO.MEL > MC14A 1 1 2 0 > MC17 4 2 0 0 > MC22A 8 0 0 0 > MC25 13 3 0 0 > MC27 0 0 0 0 > MC29A1 1 0 0 0 > MC30A 1 0 0 0 > MC31A 4 1 0 0 > MC31B 4 0 0 0 > MC33 8 0 0 0 > MC38 7 0 0 0 > MC40A 12 3 0 0 > MC42 0 0 0 0 > MC45 9 0 0 0 > MC47A 0 0 0 0 > MC49A 5 0 0 0 > MC50 2 0 0 1 > MC51 13 0 0 0 > MC66 4 0 0 0 > MY11B 13 1 0 0 > MY13 0 0 0 0 > MY7B 1 0 0 0 > MY8 3 2 0 1 > > > This is my call to randomForest: > > FY09BUGS.rF <- randomForest(Site~ .,data=FY09Bugs, ntree=500, > mtry=sqrt(ncol(FY09Bugs)), replace=TRUE,importance=TRUE, proximity=TRUE, > norm.votes=TRUE, keep.forest=TRUE, do.trace=100) > > I am following the iris data example with my formula but the print data on > FY09BUGS.rf returns 100% OOB error rate and the summary returns: > > summary(FY09BUGS.rF) > Length Class Mode > call 11 -none- call > type 1 -none- character > predicted 23 factor numeric > err.rate 12000 -none- numeric > confusion 552 -none- numeric > votes 529 matrix numeric > oob.times 23 -none- numeric > classes 23 -none- character > importance 3625 -none- numeric > importanceSD 3480 -none- numeric > localImportance 0 -none- NULL > proximity 529 -none- numeric > ntree 1 -none- numeric > mtry 1 -none- numeric > forest 14 -none- list > y 23 factor numeric > test 0 -none- NULL > inbag 0 -none- NULL > terms 3 terms call > > One concern I have is that the iris example does not appear to give a > training data set and so I don't believe I have done that either. I feel like > there is potential here but I can't seem to find the solution searching > online so I put the questions to you! Thanks in advance for any assistance or > constructive criticism. > > Kyle > > > Kyle Hall . > City of Charlotte Storm Water Services > Water Quality Modeler > 600 East Fourth Street > Charlotte, NC 28202 > 704.336.4110 > Fax: 704.353.0473 > > [[alternative HTML version deleted]] > > _______________________________________________ > R-sig-ecology mailing list > R-sig-ecology@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-ecology > -- Gavin Simpson, PhD [t] +1 306 337 8863 Adjunct Professor, Department of Biology [f] +1 306 337 2410 Institute of Environmental Change & Society [e] gavin.simp...@uregina.ca 523 Research and Innovation Centre [tw] @ucfagls University of Regina Regina, SK S4S 0A2, Canada _______________________________________________ R-sig-ecology mailing list R-sig-ecology@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology