Hi R-Users,

I have a student doing work with lionfish and she has been trying to analyse
a multivariate dataset to see what variables/factors are influencing the
behaviour of lionfish. We have attempted a number of analyses, including
rpart, relimpo and standard linear regression but we are not having much
luck with quality output. The data is very non-normal and we would
appreciate some advice on the best way to go about analysing it.

Kathy has provided a synopsis below along with part of the dataset below.

Any help/advice appreciated.

  I am stuck in a problem with a dataset on a behavior study on Indo-Pacific
lionfish *Pterois volitans*. The idea is to find out whether lionfish behave
differently at different locations and times of day and whether these
differences can be accounted for by any of the explanatory variables
measured.
  My response variable is a series of behavior categories: (1) rest, (2)
passive hunting and (3) active hunting. I have chosen to treat them
individually because each one has a different biological importance, so
basically I am trying to come up with an answer for 3 response variables.
Measurement for these behavior categories is proportion of time (10 minute
observation) spent at the activity described and values range from 0 to 1.
Explanatory variables are a mix of categorical and continuous variables and
are six: Region (Guam and Philippines), Hours after Sunrise, Habitat (5
categories), Weather (3 categories), Current (3 categories) and Lionfish
Size (cm).

  The following is an example of the dataset for response variable Rest (R)

  R

REG

HAS

HAB

WE

CU

SI

0.05

0

11.0166667

Artificial

2

0

10

0.05

0

0.56666667

Rock_boulder_cave

1

1

11

0.05

0

9.13333333

Artificial

1

1

18

0.1

0

4.2

Sand_rubble

1

2

20

0.1

0

9.13333333

Rock_boulder_cave

1

2

10

0.1

0

9.6

Sand_rubble

0

0

7

0.1

0

0.78333333

Rock_boulder_cave

1

0

31

0.1

0

1.28333333

Artificial

1

0

20

0.1

0

10.8666667

Coral

1

0

22

0.15

0

10.4166667

Coral

0

1

27

0.2

0

3.46666667

Rock_boulder_cave

0

0

8

0.2

0

1.23333333

Rock_boulder_cave

1

0

25

0.45

1

11.6833333

Coral

2

0

15

0.5

1

11.0166667

Artificial

1

2

14

0.5

1

11.9166667

Artificial

0

0

14

0.5

1

9.53333333

Artificial

1

0

24

0.5

1

9.83333333

Artificial

1

0

15

0.5

1

11.5833333

Rock_boulder_cave

1

1

29

0.53

1

5.91666667

Coral

1

1

15

0.6

1

11.0166667

Artificial

1

2

17

0.6

1

9.78333333

Rock_boulder_cave

0

0

12

0.6

1

4.68333333

Sand_rubble

2

0

14

0.6

1

5.01666667

Rock_boulder_cave

2

0

16

0.6

1

3.18333333

Artificial

2

1

19

0.65

1

5.25

Coral

2

0

15

0.65

1

9.63333333

Sand_rubble

1

1

17




   As you can see here I have converted categorical variables region,
current and weather to numerical; region because it can be expressed in
binary form and the other two because they represent a quantity. For habitat
I have created a dummy variable based on deviation coding, and introduced it
as a variable in my model.
   Total sample size is 357, of which each sample is an observation at a
particular time of day. A histogram of my response variable is not normally
distributed and has a bit of a U-shape with lots of 0s and 1s, which means
the animal was either completely engaged in that activity during the 10 min.
observation or didn't show it at all. I have tried a series of
transformations to normalize but have been unsuccessful (log, log(x+1), ln,
sqrt, fourth root).
    What type of analyses have I tried?
(1) Regression trees.
     Using categorical variables as categorical without changing into
numerical. This was coded with package rpart and is the preferred analyses
due to ease of interpretation. The response variable was untransformed and
the distribution chosen Poisson. Result was a tree with immediately
increasing error (cp) which picked 0 splits as the best tree.

(2) Multiple regression
    Tried using package relaimpo to obtain a classification on the
importance of explanatory variables. Used different transformations to
analyze residuals and in all cases obtained a weird looking set of residuals
with a portion normally distributed and another portion clustered to the
side, giving the whole graph a clear trend (my guess is these are all the 1s
and 0s in the data).
    I also tried non-linear regressions (glm) with package pscl (Poisson,
negative binomial and zero inflated negative binomial. In all cases fit
seemed adequate but variance explained was very small and coefficients
estimated for my EVs very low.

   Any ideas??? I have lastly used Primer to analyze the response variable
in response to each EV individually. That works well but limits my
conclusions and doesn't allow me to account for variation in one of the EVs
affecting others. I appreciate any help I can get,

-- 
Andrew Halford Ph.D
Associate Research Scientist
Marine Laboratory
University of Guam
Ph: +1 671 734 2948

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to