Hi R-Users, I have a student doing work with lionfish and she has been trying to analyse a multivariate dataset to see what variables/factors are influencing the behaviour of lionfish. We have attempted a number of analyses, including rpart, relimpo and standard linear regression but we are not having much luck with quality output. The data is very non-normal and we would appreciate some advice on the best way to go about analysing it.
Kathy has provided a synopsis below along with part of the dataset below. Any help/advice appreciated. I am stuck in a problem with a dataset on a behavior study on Indo-Pacific lionfish *Pterois volitans*. The idea is to find out whether lionfish behave differently at different locations and times of day and whether these differences can be accounted for by any of the explanatory variables measured. My response variable is a series of behavior categories: (1) rest, (2) passive hunting and (3) active hunting. I have chosen to treat them individually because each one has a different biological importance, so basically I am trying to come up with an answer for 3 response variables. Measurement for these behavior categories is proportion of time (10 minute observation) spent at the activity described and values range from 0 to 1. Explanatory variables are a mix of categorical and continuous variables and are six: Region (Guam and Philippines), Hours after Sunrise, Habitat (5 categories), Weather (3 categories), Current (3 categories) and Lionfish Size (cm). The following is an example of the dataset for response variable Rest (R) R REG HAS HAB WE CU SI 0.05 0 11.0166667 Artificial 2 0 10 0.05 0 0.56666667 Rock_boulder_cave 1 1 11 0.05 0 9.13333333 Artificial 1 1 18 0.1 0 4.2 Sand_rubble 1 2 20 0.1 0 9.13333333 Rock_boulder_cave 1 2 10 0.1 0 9.6 Sand_rubble 0 0 7 0.1 0 0.78333333 Rock_boulder_cave 1 0 31 0.1 0 1.28333333 Artificial 1 0 20 0.1 0 10.8666667 Coral 1 0 22 0.15 0 10.4166667 Coral 0 1 27 0.2 0 3.46666667 Rock_boulder_cave 0 0 8 0.2 0 1.23333333 Rock_boulder_cave 1 0 25 0.45 1 11.6833333 Coral 2 0 15 0.5 1 11.0166667 Artificial 1 2 14 0.5 1 11.9166667 Artificial 0 0 14 0.5 1 9.53333333 Artificial 1 0 24 0.5 1 9.83333333 Artificial 1 0 15 0.5 1 11.5833333 Rock_boulder_cave 1 1 29 0.53 1 5.91666667 Coral 1 1 15 0.6 1 11.0166667 Artificial 1 2 17 0.6 1 9.78333333 Rock_boulder_cave 0 0 12 0.6 1 4.68333333 Sand_rubble 2 0 14 0.6 1 5.01666667 Rock_boulder_cave 2 0 16 0.6 1 3.18333333 Artificial 2 1 19 0.65 1 5.25 Coral 2 0 15 0.65 1 9.63333333 Sand_rubble 1 1 17 As you can see here I have converted categorical variables region, current and weather to numerical; region because it can be expressed in binary form and the other two because they represent a quantity. For habitat I have created a dummy variable based on deviation coding, and introduced it as a variable in my model. Total sample size is 357, of which each sample is an observation at a particular time of day. A histogram of my response variable is not normally distributed and has a bit of a U-shape with lots of 0s and 1s, which means the animal was either completely engaged in that activity during the 10 min. observation or didn't show it at all. I have tried a series of transformations to normalize but have been unsuccessful (log, log(x+1), ln, sqrt, fourth root). What type of analyses have I tried? (1) Regression trees. Using categorical variables as categorical without changing into numerical. This was coded with package rpart and is the preferred analyses due to ease of interpretation. The response variable was untransformed and the distribution chosen Poisson. Result was a tree with immediately increasing error (cp) which picked 0 splits as the best tree. (2) Multiple regression Tried using package relaimpo to obtain a classification on the importance of explanatory variables. Used different transformations to analyze residuals and in all cases obtained a weird looking set of residuals with a portion normally distributed and another portion clustered to the side, giving the whole graph a clear trend (my guess is these are all the 1s and 0s in the data). I also tried non-linear regressions (glm) with package pscl (Poisson, negative binomial and zero inflated negative binomial. In all cases fit seemed adequate but variance explained was very small and coefficients estimated for my EVs very low. Any ideas??? I have lastly used Primer to analyze the response variable in response to each EV individually. That works well but limits my conclusions and doesn't allow me to account for variation in one of the EVs affecting others. I appreciate any help I can get, -- Andrew Halford Ph.D Associate Research Scientist Marine Laboratory University of Guam Ph: +1 671 734 2948 [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.