from:"Robert A. LaBudde"

Re: [R] significance testing for the difference in the ratio of means

2013-06-14 Thread Robert A LaBudde

groups of size 70 and 72 to represent the two ratios and
compute the difference in means. This distribution could represent the
distribution under the null hypothesis and I could then measure where my
observed value falls to compute the p-value. This however, makes me
uncomfortable as it seems to treat the data as a mean of ratios rather
than a ratio of means.

Method 4: Combination of bootstrap and permutation test
Sample with replacement samples of size 7,10,8,9 from m1,m2,m3,m4
respectively as in method 2 above. Calculate the two ratios for these 4
samples (m2/m1 and m4/m3). Record these two ratios into a list. Repeat
this process an arbitrary (B) number of times and record the two ratios
into your growing list each time. Hence if B = 10, we will have 20
observations of the ratios. Then proceed with permutation testing with
these 20 ratio observations by repeatedly randomizing them into two equal
groups of 10 and computing the difference in means of the two groups as we
did in method 3 above. This could potentially yeild a distribution under
the null hypothesis and p-values could be obtained by localizing the
observed value on this distribution. I am unsure of appropriate
values for

B or if this method is valid at all.

Another complication would be the concern for multiple comparisons if I
wished to include additional test groups (m5 = testgroup2; m6 =
testgroup2

w/ treatment; m7 = testgroup3, m8 = testgoup3 w/ treatment...etc) and how
that might be appropriately handled.

Method 2 seems the most intuitive to me. Bootstrapping this way will
likely yield appropriate Starndard Errors for the two ratios. However, I
am very much interested in appropriate p-values for the
comparison and I am

not sure if localizing 0 on the bootstrap distribution of the difference
of means is appropriate.

Thank you in advance for your suggestions.

-Rahul

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:

http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

[R] Mininum number of resamples required to do BCa bootstrap?

2012-10-08 Thread Robert A. LaBudde

I'm using R 2.15.1 on a 64-bit machine with Windows 7 Home Premium 
and package 'boot'.


I've found that using a number of bootstrap resamples in boot() that 
is less than the number of data results in a fatal error. Once the 
number of resamples meets or exceeds the number of data, the error disappears.


Sample problem (screwy subscripted syntax is a relic of edited down a 
more complex script):


 N - 25
 s - rlnorm(N, 0, 1)
 require(boot)
Loading required package: boot
 v - NULL # hold sample variance estimates
 i - 1
 v[i] - var(s) # get sample variance
 nReal - 10
 varf - function (x,i) { var(x[i]) }
 fabc - function (x, w) { # weighted average (biased) variance
+   sum(x^2 * w) / sum(w) - (sum(x * w) / sum(w))^2
+ }
 p - c(.25, .75, .2, .8, .15, .85, .1, .9, .05, .95, .025, .975, 
.005, .995)

 cl - c(0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99)
 b - boot(s, varf, R = nReal) # bootstrap
 bv - NULL # hold bootstrap mean variance estimates
 bias - NULL #hold bias estimates
 bv[i] - mean(b$t) # bootstrap mean variance
 bias[i] - bv[i] - v[i] # bias estimate
 bCI90 - boot.ci(b, conf = 0.90)
Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o,  :
  estimated adjustment 'a' is NA
In addition: Warning messages:
1: In norm.inter(t, (1 + c(conf, -conf))/2) :
  extreme order statistics used as endpoints
2: In boot.ci(b, conf = 0.9) :
  bootstrap variances needed for studentized intervals
3: In norm.inter(t, alpha) : extreme order statistics used as endpoints

 nReal - 25
 b - boot(s, varf, R = nReal) # bootstrap
 bv[i] - mean(b$t) # bootstrap mean variance
 bias[i] - bv[i] - v[i] # bias estimate
 bCI90 - boot.ci(b, conf = 0.90)
Warning messages:
1: In boot.ci(b, conf = 0.9) :
  bootstrap variances needed for studentized intervals
2: In norm.inter(t, adj.alpha) :
  extreme order statistics used as endpoints

The problem is that doing 10 resamples generates an NA in the 
estimation of the 'acceleration' in the function abc.ci(), but doing 
25 resamples does not. This implies a connection between the number 
of resamples and the 'acceleration' which should not exist. 
('Acceleration' should be obtained from the original sample via 
jackknife as 1/6 the coefficient of skewness.)


The script apparently works correctly if the number of resamples 
equals or exceeds the number of original data, but not otherwise.




Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Possible error in BCa method for confidence intervals in package 'boot'

2012-10-02 Thread Robert A. LaBudde


I'm using R 2.15.1 on a 64-bit machine with Windows 7 Home Premium.

Sample problem (screwy subscripted syntax is a relic of edited down a 
more complex script):


 N - 25
 s - rlnorm(N, 0, 1)
 require(boot)
Loading required package: boot
 v - NULL # hold sample variance estimates
 i - 1
 v[i] - var(s) # get sample variance
 nReal - 10
 varf - function (x,i) { var(x[i]) }
 fabc - function (x, w) { # weighted average (biased) variance
+   sum(x^2 * w) / sum(w) - (sum(x * w) / sum(w))^2
+ }
 p - c(.25, .75, .2, .8, .15, .85, .1, .9, .05, .95, .025, .975, 
.005, .995)

 cl - c(0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99)
 b - boot(s, varf, R = nReal) # bootstrap
 bv - NULL # hold bootstrap mean variance estimates
 bias - NULL #hold bias estimates
 bv[i] - mean(b$t) # bootstrap mean variance
 bias[i] - bv[i] - v[i] # bias estimate
 bCI90 - boot.ci(b, conf = 0.90)
Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o,  :
  estimated adjustment 'a' is NA
In addition: Warning messages:
1: In norm.inter(t, (1 + c(conf, -conf))/2) :
  extreme order statistics used as endpoints
2: In boot.ci(b, conf = 0.9) :
  bootstrap variances needed for studentized intervals
3: In norm.inter(t, alpha) : extreme order statistics used as endpoints

 nReal - 25
 b - boot(s, varf, R = nReal) # bootstrap
 bv[i] - mean(b$t) # bootstrap mean variance
 bias[i] - bv[i] - v[i] # bias estimate
 bCI90 - boot.ci(b, conf = 0.90)
Warning messages:
1: In boot.ci(b, conf = 0.9) :
  bootstrap variances needed for studentized intervals
2: In norm.inter(t, adj.alpha) :
  extreme order statistics used as endpoints

The problem is that doing 10 resamples generates an NA in the 
estimation of the 'acceleration' in the function abc.ci(), but doing 
25 resamples does not. This implies a connection between the number 
of resamples and the 'acceleration' which should not exist. 
('Acceleration' should be obtained from the original sample via 
jackknife as 1/6 the coefficient of skewness.)


Looking at the script for abc.ci(), there is an anomalous reference 
to 'n' in the invocation line, yet 'n' is not an argument, so must be 
defined more globally before the call. Yet 'n' is defined within the 
script as the length of 'data', which is referred to as the 
'bootstrap' vector in the comments, yet should be the original sample 
data. This confusion, plus the use of an argument 'eps' as a default 
0.001/n in the calculations makes me suspect the programming in the script.


The script apparently works correctly if the number of resamples 
equals or exceeds the number of original data, but not otherwise.




Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] coxph() - unexpected result using Crawley's seedlings data (The R Book)

2011-06-28 Thread Robert A LaBudde


Did you create the 'status' variable the way indicated on p. 797?

Frequently with Surv() it pays to use syntax such as Surv(death, 
status==1) to make a clear logical statement of what is an event 
(status==1) vs. censored.


PS. Next time include head(seedlings) and str(seedlings) to make 
clear what you are using as data.



At 06:51 AM 6/28/2011, Jacob Brogren wrote:

Hi,

I ran the example on pp. 799-800 from Machael Crawley's The R Book 
using package survival v. 2.36-5, R 2.13.0 and RStudio 0.94.83. The 
model is a Cox's Proportional Hazards model. The result was quite 
different compared to the R Book. I have compared my code to the 
code in the book but can not find any differences in the function 
call. My results are attached as well as a link to the results 
presented in the book (link to Google Books).


When running the examples on pp. 797-799 I can't detect any 
differences in results so I don't think there are errors in the data 
set or in the creation of the status variable.


-
Original from the R Book:
http://books.google.com/books?id=8D4HVx0apZQClpg=PA799ots=rQgd_8ofeSdq=r%20coxph%20crawleypg=PA799#v=onepageqf=false

-
My result:
 summary(model1)
Call:
coxph(formula = Surv(death, status) ~ strata(cohort) * gapsize,
data = seedlings)

  n= 60, number of events= 60

coef 
exp(coef)  se(coef)  z Pr(|z|)
gapsize-0.001893  0.998109  0.593372 
-0.0030.997
gapsize:strata(cohort)cohort=September  0.717407  2.049112  0.860807 
 0.8330.405


   exp(coef) exp(-coef) lower 
.95 upper .95
gapsize   0.9981  1.002 
0.3120 3.193
gapsize:strata(cohort)cohort=September2.0491  0.488 
0.379211.074


Rsquare= 0.022   (max possible= 0.993 )
Likelihood ratio test= 1.35  on 2 df,   p=0.5097
Wald test= 1.32  on 2 df,   p=0.5178
Score (logrank) test = 1.33  on 2 df,   p=0.514

Anyone have an idea why this is occurring?

Kind Regards

Jacob
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] analysing a three level reponse

2011-06-22 Thread Robert A LaBudde

The best method would probably be proportional odds regression using 
polyr() in 'MASS'.


At least it's a starting point.


At 09:19 AM 6/22/2011, Matt Ellis \(Research\) wrote:

Hello,
I am struggling to figure out how to analyse a dataset I have inherited
(please note this was conducted some time ago, so the data is as it is,
and I know it isn't perfect!).

A brief description of the experiment follows:
Pots of grass were grown in 1l pots of standad potting medium for 1
month with a regular light and watering regime. At this point they were
randomly given 1l of one of 4 different pesticides at one of 4 different
concentrations (100%, 75%, 50% or 25% in water). There were 20 pots of
grass for each pesticide/concentration giving 320 pots. There were no
control (untreated) pots. The response was measured after 1 week and
recorded as either:
B1 - grass dead
B2 - grass affected but not dead
B3 - no visible effect

I could analyse this as lethal effect vs non-lethal effect (B1 vs B2+B3)
or some effect vs no effect (B1+B2 vs B3) binomial model, but I can't
see how to do it with three levels.

Any pointing in the right direction greatly appreciated!
Thanks
Matt

--
Disclaimer: This email and any files transmitted with it are 
confidential and intended solely for the use of the individual or 
entity to whom they are addressed. If you have received this email 
in error please notify me at matt.el...@basc.org.uk then delete it. 
BASC may monitor email traffic.  By replying to this e-mail you 
consent to BASC monitoring the content of any email you send or 
receive from BASC. Any views expressed in this message are those of 
the individual sender, except where the sender specifies with 
authority, states them to be the views of the British Association 
for Shooting and Conservation. BASC can confirm that this email 
message and any attachments have been scanned for the presence of 
computer viruses but recommends that you make your own virus checks. 
Registered Industrial and Provident Society No.: 28488R. Registered 
Office: Marford Mill, Rossett, Wrexham, LL12 0HL.

--



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Using MCMC sampling to estimate p values with a mixed model

2011-06-19 Thread Robert A LaBudde

If your alternative hypothesis is unequal variances (2-sided), both F 
 1 and F  1 are of interest, and rejection of the equal variance 
null can occur on either side.


The usual ANOVA F test is 1-sided, with an alternative the numerator 
variance exceeds the denominator one, so this is perhaps why you are confused.


At 02:40 PM 6/18/2011, Woodcock, Helena wrote:

Hi everyone,

Apologies if this is a silly question but I am a student and this is 
my first time using R so I am still trying to educate myself on 
commands, models e.t.c


I have a mixed model with four dichotomous fixed factors and subject 
as a random factor (as each person completed four vignettes, with 
factors crossed across vignettes).


I have run an lmer model and used the Monte Carlo method to see if 
there are any significant main effects or interactions. However, 
when I looked at the p values some are showing as significant 
although the F value is less than 1. Is it possible to have a 
significant effect with an F value below 1?.


I have a sample size of 150 and have read that the pMCMC values can 
be anti-conservative so wonder if it is because my sample size may 
be too small?.


Thank you for any help

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] How to convert a factor column into a numeric one?

2011-06-04 Thread Robert A. LaBudde


I have a data frame:

 head(df)
  Time Temp Conc ReplLog10
10  -20H1 6.406547
22  -20H1 5.738683
37  -20H1 5.796394
4   14  -20H1 4.413691
504H1 6.406547
774H1 5.705433
 str(df)
'data.frame':   177 obs. of  5 variables:
 $ Time : Factor w/ 4 levels 0,2,7,14: 1 2 3 4 1 3 4 1 3 4 ...
 $ Temp : Factor w/ 4 levels -20,4,25,..: 1 1 1 1 2 2 2 3 3 3 ...
 $ Conc : Factor w/ 3 levels H,L,M: 1 1 1 1 1 1 1 1 1 1 ...
 $ Repl : Factor w/ 5 levels 1,2,3,4,..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Log10: num  6.41 5.74 5.8 4.41 6.41 ...
 levels(df$Temp)
[1] -20 4   25  45
 levels(df$Time)
[1] 0  2  7  14

As you can see, Time and Temp are currently factors, not numeric.

I would like to change these columns into numerical ones.

df$Time- as.numeric(df$Time)

doesn't work, as it changes to the factor level indices (1,2,3,4) 
instead of the values (0,2,7,14).


There must be a direct way of doing this in R.

I tried recode() in 'car':

 df$Temp- recode(df$Temp, '1=-20;2=25;3=4;4=45',as.factor.result=FALSE)
 head(df)
  Time Temp Conc Repl Freq
10  -20H1 6.406547
22  -20H1 5.738683
37  -20H1 5.796394
4   14  -20H1 4.413691
50   45H1 6.406547
77   45H1 5.705433

but note that the values for 'Temp' in rows 5 and 7 are 45 and not 4, 
as expected, although the result is numeric. The same happens if I 
use the order given by levels(df$Temp) instead of the sort order in 
the recode() 2nd argument.


Any hints?

Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to convert a factor column into a numeric one?

2011-06-04 Thread Robert A LaBudde


Exactly! Thanks.

At 12:49 AM 6/5/2011, Jorge Ivan Velez wrote:

Dr. LaBudde,

Perhaps

as.numeric(as.character(x))

is what you are looking for.

HTH,
Jorge


On Sun, Jun 5, 2011 at 12:31 AM, Robert A. LaBudde  wrote:
I have a data frame:

 head(df)
 Time Temp Conc ReplLog10
10  -20H1 6.406547
22  -20H1 5.738683
37  -20H1 5.796394
4   14  -20H1 4.413691
504H1 6.406547
774H1 5.705433
 str(df)
'data.frame':   177 obs. of  5 variables:
 $ Time : Factor w/ 4 levels 0,2,7,14: 1 2 3 4 1 3 4 1 3 4 ...
 $ Temp : Factor w/ 4 levels -20,4,25,..: 1 1 1 1 2 2 2 3 3 3 ...
 $ Conc : Factor w/ 3 levels H,L,M: 1 1 1 1 1 1 1 1 1 1 ...
 $ Repl : Factor w/ 5 levels 1,2,3,4,..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Log10: num  6.41 5.74 5.8 4.41 6.41 ...
 levels(df$Temp)
[1] -20 4   25  45
 levels(df$Time)
[1] 0  2  7  14

As you can see, Time and Temp are currently factors, not numeric.

I would like to change these columns into numerical ones.

df$Time- as.numeric(df$Time)

doesn't work, as it changes to the factor level indices (1,2,3,4) 
instead of the values (0,2,7,14).


There must be a direct way of doing this in R.

I tried recode() in 'car':

 df$Temp- recode(df$Temp, '1=-20;2=25;3=4;4=45',as.factor.result=FALSE)
 head(df)
 Time Temp Conc Repl Freq
10  -20H1 6.406547
22  -20H1 5.738683
37  -20H1 5.796394
4   14  -20H1 4.413691
50   45H1 6.406547
77   45H1 5.705433

but note that the values for 'Temp' in rows 5 and 7 are 45 and not 
4, as expected, although the result is numeric. The same happens if 
I use the order given by levels(df$Temp) instead of the sort order 
in the recode() 2nd argument.


Any hints?

Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: 
mailto:r...@lcfltd.comr...@lcfltd.com
Least Cost Formulations, Ltd.URL: 
http://lcfltd.com/http://lcfltd.com/

824 Timberlake Drive Tel: tel:757-467-0954757-467-0954
Virginia Beach, VA 23464-3239Fax: tel:757-467-2947757-467-2947

Vere scire est per causas scire

__
mailto:R-help@r-project.orgR-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-helphttps://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.htmlhttp://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.




Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to convert a factor column into a numeric one?

2011-06-04 Thread Robert A LaBudde


Thanks for your help.

As far as your question below is concerned, the data frame arose as a 
result of some data cleaning on an original data frame, which was 
changed into a table, modified, and changed back to a data frame:


ttcrmean- as.table(by(ngbe[,'Log10'], 
list(Time=ngbe$Time,Temp=ngbe$Temp,Conc=ngbe$Conc,Repl=ngbe$Replicate),

  mean))
for (k in 1:3) {  #fix-up time zeroes
  for (l in 1:5) { #replicates
t0val- ttcrmean[1,3,k,l]
for (j in 1:4) {  #temps
  ttcrmean[1,j,k,l]- t0val
} #j
  } #l
} #i
df- na.omit(as.data.frame(ttcrmean))
colnames(df)[5]- 'Log10'


At 12:51 AM 6/5/2011, Joshua Wiley wrote:

Hi Robert,
snip
I would also look into *why* those numeric columns are being stored as
factors in the first place.  If you are reading the data in with
read.table() or one of its wrapper functions (like read.csv), then it
would be better to preempt the storage as a factor altogether rather
than converting back to numeric.  For example, perhaps something is
being used to indicate missing data that R does not recognize (e.g.,
SAS uses .).  Specifying na.strings = ., would fix this.  See
?read.table for some of the options available.
snip




Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to convert a factor column into a numeric one?

2011-06-04 Thread Robert A LaBudde


Thanks! Exactly what I wanted, as the same as Jorge also suggested.

At 12:49 AM 6/5/2011, Dennis Murphy wrote:

Hi:

Try this:

 dd - data.frame(a = factor(rep(1:5, each = 4)),
+  b = factor(rep(rep(1:2, each = 2), 5)),
+  y = rnorm(20))
 str(dd)
'data.frame':   20 obs. of  3 variables:
 $ a: Factor w/ 5 levels 1,2,3,4,..: 1 1 1 1 2 2 2 2 3 3 ...
 $ b: Factor w/ 2 levels 1,2: 1 1 2 2 1 1 2 2 1 1 ...
 $ y: num  0.6396 1.467 1.8403 -0.0915 0.2711 ...
 de - within(dd, {
+  a - as.numeric(as.character(a))
+  b - as.numeric(as.character(b))
+} )
 str(de)
'data.frame':   20 obs. of  3 variables:
 $ a: num  1 1 1 1 2 2 2 2 3 3 ...
 $ b: num  1 1 2 2 1 1 2 2 1 1 ...
 $ y: num  0.6396 1.467 1.8403 -0.0915 0.2711 ...


HTH,
Dennis

On Sat, Jun 4, 2011 at 9:31 PM, Robert A. LaBudde r...@lcfltd.com wrote:
 I have a data frame:

 head(df)
  Time Temp Conc ReplLog10
 10  -20H1 6.406547
 22  -20H1 5.738683
 37  -20H1 5.796394
 4   14  -20H1 4.413691
 504H1 6.406547
 774H1 5.705433
 str(df)
 'data.frame':   177 obs. of  5 variables:
  $ Time : Factor w/ 4 levels 0,2,7,14: 1 2 3 4 1 3 4 1 3 4 ...
  $ Temp : Factor w/ 4 levels -20,4,25,..: 1 1 1 1 2 2 2 3 3 3 ...
  $ Conc : Factor w/ 3 levels H,L,M: 1 1 1 1 1 1 1 1 1 1 ...
  $ Repl : Factor w/ 5 levels 1,2,3,4,..: 1 1 1 1 1 1 1 1 1 1 ...
  $ Log10: num  6.41 5.74 5.8 4.41 6.41 ...
 levels(df$Temp)
 [1] -20 4   25  45
 levels(df$Time)
 [1] 0  2  7  14

 As you can see, Time and Temp are currently factors, not numeric.

 I would like to change these columns into numerical ones.

 df$Time- as.numeric(df$Time)

 doesn't work, as it changes to the factor level indices (1,2,3,4) 
instead of

 the values (0,2,7,14).

 There must be a direct way of doing this in R.

 I tried recode() in 'car':

 df$Temp- recode(df$Temp, '1=-20;2=25;3=4;4=45',as.factor.result=FALSE)
 head(df)
  Time Temp Conc Repl Freq
 10  -20H1 6.406547
 22  -20H1 5.738683
 37  -20H1 5.796394
 4   14  -20H1 4.413691
 50   45H1 6.406547
 77   45H1 5.705433

 but note that the values for 'Temp' in rows 5 and 7 are 45 and not 4, as
 expected, although the result is numeric. The same happens if I use the
 order given by levels(df$Temp) instead of the sort order in the 
recode() 2nd

 argument.

 Any hints?
 
 Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
 Least Cost Formulations, Ltd.URL: http://lcfltd.com/
 824 Timberlake Drive Tel: 757-467-0954
 Virginia Beach, VA 23464-3239Fax: 757-467-2947

 Vere scire est per causas scire

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

 and provide commented, minimal, self-contained, reproducible code.




Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Finding solution set of system of linear equations.

2011-05-21 Thread Robert A LaBudde


solve() only works for nonsingular systems of equations.

Use a generalized inverse for singular systems:

 A- matrix(c(1,2,1,1, 3,0,0,4, 1,-4,-2,-2, 0,0,0,0), ncol=4, byrow=TRUE)
 A
 [,1] [,2] [,3] [,4]
[1,]1211
[2,]3004
[3,]1   -4   -2   -2
[4,]0000
 b- c(0,2,2,0)  #rhs
 b
[1] 0 2 2 0

 require('MASS')
 giA- ginv(A) #M-P generalized inverse
 giA
   [,1]  [,2][,3] [,4]
[1,]  0.667  1.431553e-16  0.0
[2,]  0.333 -1.00e-01 -0.03330
[3,]  0.167 -5.00e-02 -0.01670
[4,] -0.500  2.50e-01 -0.25000

 require('Matrix')
 I- as.matrix(Diagonal(4))  #order 4 identity matrix
 I
 [,1] [,2] [,3] [,4]
[1,]1000
[2,]0100
[3,]0010
[4,]0001

 giA%*%b   #particular solution
  [,1]
[1,]  6.67e-01
[2,] -2.67e-01
[3,] -1.33e-01
[4,] -2.220446e-16
 giA%*%A - I   #matrix for parametric homogeneous solution
  [,1] [,2] [,3]  [,4]
[1,]  0.00e+00  0.0  0.0  5.551115e-16
[2,]  3.469447e-17 -0.2  0.4  4.024558e-16
[3,]  4.510281e-17  0.4 -0.8  2.706169e-16
[4,] -3.330669e-16  0.0  0.0 -7.771561e-16


At 09:34 PM 5/21/2011, dslowik wrote:

I have a simple system of linear equations to solve for X, aX=b:
 a
 [,1] [,2] [,3] [,4]
[1,]1211
[2,]3004
[3,]1   -4   -2   -2
[4,]0000
 b
 [,1]
[1,]0
[2,]2
[3,]2
[4,]0

(This is ex Ch1, 2.2 of Artin, Algebra).
So, 3 eqs in 4 unknowns. One can easily use row-reductions to find a
homogeneous solution(b=0) of:
X_1 = 0, X_2 = -c/2, X_3 = c,  X_4 = 0

and solutions of the above system are:
X_1 = 2/3, X_2 = -1/3-c/2,  X_3 = c, X_4 = 0.

So the Kernel is 1-D spanned by X_2 = -X_3 /2, (nulliity=1), rank is 3.

In R I use solve():
 solve(a,b)
Error in solve.default(a, b) :
  Lapack routine dgesv: system is exactly singular

and it gives the error that the system is exactly singular, since it seems
to be trying to invert a.
So my question is:
Can R only solve non-singular linear systems? If not, what routine should I
be using? If so, why? It seems that it would be simple and useful enough to
have a routine which, given a system as above, returns the null-space
(kernel) and the particular solution.




--
View this message in context: 
http://r.789695.n4.nabble.com/Finding-solution-set-of-system-of-linear-equations-tp3541490p3541490.html

Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] problem with glm(family=binomial) when some levels have only 0 proportion values

2011-03-02 Thread Robert A LaBudde


The algorithm is not converging. Your iterations are at the maximum.

It won't do any good to add a fractional number 
to all data, as the result will depend on the 
number added (try 1.0, 0.5 and 0.1 to see this).


The root problem is that your data are 
degenerate. Firstly, your types '2' and '3' are 
indistinguishable in your data. Secondly, 
consider the case without 'type'. If you have all 
zero data for 10 trials, you cannot discriminate 
among mu = 0, 0.1, 0.0001, 0.001 or 0.01. 
This leads to numerical instability. Thirdly, the 
variance estimate in the IRLS will start at 0.0, which gives a singularity.


Fundamentally, the algorithm is failing because 
you are at the boundary of possibilities  for a 
parameter, so special techniques are needed to do 
maximum likelihood estimation.


The simple solution is to deal with the data for 
your types separately. Another is to do more 
batches for '2' and '3' to get an observed failure.




At 05:01 AM 3/2/2011, Jürg Schulze wrote:

Hello everybody

I want to compare the proportions of germinated seeds (seed batches of
size 10) of three plant types (1,2,3) with a glm with binomial data
(following the method in Crawley: Statistics,an introduction using R,
p.247).
The problem seems to be that in two plant types (2,3) all plants have
proportions = 0.
I give you my data and the model I'm running:

  success failure type
 [1,]   0   103
 [2,]   0   102
 [3,]   0   102
 [4,]   0   102
 [5,]   0   102
 [6,]   0   102
 [7,]   0   102
 [8,]   461
 [9,]   461
[10,]   371
[11,]   551
[12,]   731
[13,]   461
[14,]   0   103
[15,]   0   103
[16,]   0   103
[17,]   0   103
[18,]   0   103
[19,]   0   103
[20,]   0   102
[21,]   0   102
[22,]   0   102
[23,]   911
[24,]   641
[25,]   461
[26,]   0   103
[27,]   0   103

 y- cbind(success, failure)

 Call:
glm(formula = y ~ type, family = binomial)

Deviance Residuals:
   Min  1Q  Median  3Q
-1.3521849  -0.427  -0.427  -0.427
   Max
 2.6477556

Coefficients:
  Estimate Std. Error z value Pr(|z|)
(Intercept)0.044450.21087   0.2110.833
typeFxC  -23.16283 6696.13233  -0.0030.997
typeFxD  -23.16283 6696.13233  -0.0030.997

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 134.395  on 26  degrees of freedom
Residual deviance:  12.622  on 24  degrees of freedom
AIC: 42.437

Number of Fisher Scoring iterations: 20


Huge standard errors are calculated and there is no difference between
plant type 1 and 2 or between plant type 1 and 3.
If I add 1 to all successes, so that all the 0 values disappear, the
standard error becomes lower and I find highly significant differences
between the plant types.

suc- success + 1
fail- 11 - suc
Y- cbind(suc,fail)

Call:
glm(formula = Y ~ type, family = binomial)

Deviance Residuals:
   Min  1Q  Median  3Q
-1.279e+00  -4.712e-08  -4.712e-08   0.000e+00
   Max
 2.584e+00

Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept)   0.2231 0.2023   1.103 0.27
typeFxC  -2.5257 0.4039  -6.253 4.02e-10 ***
typeFxD  -2.5257 0.4039  -6.253 4.02e-10 ***
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 86.391  on 26  degrees of freedom
Residual deviance: 11.793  on 24  degrees of freedom
AIC: 76.77

Number of Fisher Scoring iterations: 4


So I think the 0 values of all plants of group 2 and 3 are the
problem, do you agree?
I don't know why this is a problem, or how I can explain to a reviewer
why a data transformation (+ 1) is necessary with such a dataset.

I would greatly appreciate any comments.
Juerg
__

Jürg Schulze
Department of Environmental Sciences
Section of Conservation Biology
University of Basel
St. Johanns-Vorstadt 10
4056 Basel, Switzerland
Tel.: ++41/61/267 08 47

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html

Re: [R] [R-sig-ME] Fwd: Re: ANOVA and Pseudoreplication in R

2011-02-26 Thread Robert A LaBudde

, either Dettol or Garlic. 
and press them

  and and incubate them.
  Then when the zones have appeared after a day or 2. I take 4 diameter
  measurements from each zone, across the zone at different angles, to take
  account for the fact, that there may be a weird shape, or not quite
  circular.

  I'm concerned about pseudoreplication, such as the multiple readings from
  one disk, and the 5 lineages - which might be different from 
one another in

  each of the Two EV. treatments, but not with NE. treatments.

  I read that I can remove pseudoreplication from  the multiple 
readings from
  each disk, by using the 4 readings on each disk, to produce a 
mean for the
  disks, and analyse those means - Exerciseing caution where 
there are extreme

  values. I think the 3 disks for each lineage themselves are not
  pseudoreplication, because they are genuinley 3 disks on a 
plate: the Disk
  Diffusion Test replicated 3 times - but the multiple readings 
from one disk
  if eel, is pseudoreplication. I've also read about including 
Error() terms

  in a formula.

  I'm unsure of the two NE. Treatments comming from the same 
culture does not
  introduce pseudoreplications at Treatment Factor Level, because 
of the two

  different antimicrobials used have two different effects.

  I was hoping for a more expert opinion on whether I have identified
  pseudoreplication correctly or if there is indeed 
pseudoreplication in the 5
  Lineages or anywhere else I haven't seen. And how best this is 
dealt with in

  R. At the minute my solution to the multiple readings from one disk is to
  simply make a new factor, with the means on and do Anova from 
that, or even

  take the means before I even load the dataset into R. I'm wondering if an
  Error() term would be correct.

  Thanks,
  Ben W.

  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

  and provide commented, minimal, self-contained, reproducible code.




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] comparing proportions

2011-02-10 Thread Robert A LaBudde


prop.test() is applicable to a binomial experiment in each of two classes.

Your experiment is binomial only at the subject level. You then have 
multiple subjects in each of your groups.


You have a random factor Subjects that must be accounted for.

The best way to analyze is a generalized linear mixed model with a 
binomial distribution family and a logit or probit link. You will 
probably have to investigate overdispersion. If you have a small 
number of subjects, and don't care about the among-subject effect, 
you can model them as fixed effects and use glm() instead.


Your original question, I believe, related to doing an ANOVA assuming 
normality. In order for this to work with this kind of proportion 
problem, you generally won't get good results unless the number of 
replicates per subject is 12 or more, and the proportions involved 
are within 0.15 to 0.85. Otherwise you will have biased confidence 
intervals and significance tests.




At 07:51 PM 2/9/2011, array chip wrote:

Content-type: text/plain
Content-disposition: inline
Content-length: 2969

Hi Bert,

Thanks for your reply. If I understand correctly, prop.test() is not 
suitable to
my situation. The input to prop.test() is 2 numbers for each group 
(# of success
and # of trials, for example, groups 1 has 5 success out of 10 
trials; group 2
has 3 success out of 7 trials; etc. prop.test() tests whether the 
probability of

success is the same across groups.

In my case, each group has several subjects and each subject has 2 numbers (#
success and # trials). So

for group 1:
subject 1: 5 success, 10 trials
subject 2: 3 success, 8 trials
:
:

for group 2:
subject a: 7 success, 9 trials
subject b: 6 success, 7 trials
:
:

I want to test whether the probability of success in group 1 is the 
same as in

group 2. It's like comparing 2 groups of samples using t test, what I am
uncertain about is that whether regular t test (or non-pamametric 
test) is still

appropriate here when the response variable is actually proportions.

I guess prop.test() can not be used with my dataset, or I may be wrong?

Thanks

John








From: Bert Gunter gunter.ber...@gene.com

Sent: Wed, February 9, 2011 3:58:05 PM
Subject: Re: [R] comparing proportions

1. Is this a homework problem?

2. ?prop.test

3. If you haven't done so already, get and consult a basic statistical
methods book to help you with questions such as this.

-- Bert


 Hi, I have a dataset that has 2 groups of samples. For each sample, then
 response measured is the number of success (no.success) obatined with the
number
 of trials (no.trials). So a porportion of success (prpop.success) can be
 computed as no.success/no.trials. Now the objective is to test if 
there is a
 statistical significant difference in the proportion of success 
between the 2

 groups of samples (say n1=20, n2=30).

 I can think of 2 ways to do the test:

 1. regular t test based on the variable prop.success
 2. Mann-Whitney test based on the variable prop.success
 2. do a binomial regression as:
 fit-glm(cbind(no.success,no.trials-no.success) ~ group, data=data,
  family=binomial)
 anova(fit, test='Chisq')

 My questions is:
 1. Is t test appropriate for comparing 2 groups of proportions?
 2. how about Mann-Whitney non-parametric test?
 3. Among the 3, which technique is more appropriate?
 4. any other technique you can suggest?

 Thank you,

 John



[[alternative HTML version deleted]]


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

 and provide commented, minimal, self-contained, reproducible code.





--
Bert Gunter
Genentech Nonclinical Biostatistics




[[alternative HTML version deleted]]


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] comparing proportions

2011-02-10 Thread Robert A LaBudde

1. If you use a random effects model, you should make Subject the 
random factor. I.e., a random intercepts model with 1|Subject. Group 
is a fixed effect: You have only 2 groups. Even if you had more than 
2 groups, treating Group as random would return a standard deviation, 
not a P-value as you wanted. Finally, I doubt you believe the groups 
used are meaningless, and only the population of groups is of 
interest. Instead you consider them special, so Group is a fixed effect.


2. The number of observations for each Subject is the number of 
trials, which you previously indicated were 7 to 10 in the cases listed.


3. If you have no interest in the Subject effect, you can use a fixed 
Subject factor instead with glm() instead of glmer() or other mixed 
model function. This is a good idea so long as the number of subjects 
is, say, less than 10. Otherwise a mixed model would be a better idea.


I suggest you fit all three models to learn about what you're doing: 
1) glmer() or equivalent, with cbind(successes, failures) ~ 1|Subject 
+ Group; 2) glm() with cbind(successes, failures) ~ Subject + Group; 
and 3) lm(p ~ Subject + Group), where p is the proportion success for 
a particular subject and group.


Then compare the results. They will probably all 3 give the same 
conclusion to the hypothesis question about Group. I would guess the 
glmer() P-value will be larger, then the glm() and finally the lm(), 
but the last two may reverse. The lm() model may actually perform 
fairly well, as the Edgeworth series converges rapidly to normal for 
binomial distributions with p within 0.15 to 0.85 and 10+ replicates, 
as I stated before.


I'd be interested in seeing the results of these 3 fits myself just 
for curiosity.


At 01:21 PM 2/10/2011, array chip wrote:
Robert and Bert, thank you both very much for the response, really 
appreciated. I agree that using regular ANOVA (or regular t test) 
may not be wise during the normality issue. So I am debating between 
generalized linear model using glm(.., family=binomial) or 
generalized linear mixed effect model using glmer(..., 
family=binomial). I will forward to Robert an offline list email I 
sent to Bert about whether using (1|subject) versus (1|group) in 
mixed model specification. If using (1|group), both models will give 
me the same testing for fixed effects, which is what I am mainly 
interested in. So do I really need a mixed model here?


Thanks again

John


From: Bert Gunter gunter.ber...@gene.com
To: Robert A LaBudde r...@lcfltd.com
Cc: array chip arrayprof...@yahoo.com
Sent: Thu, February 10, 2011 10:04:06 AM
Subject: Re: [R] comparing proportions

Robert:

Yes, exactly. In an offlist email exchange, he clarified this for me,
and I suggested exactly what you did, also with the cautions that his
initial ad hoc suggestions were unwise. His subsequent post to R-help
and the sig-mixed-models lists were the result, although he appears to
have specified the model incorrectly in his glmer function (as
(1|Group) instead of (1|subject).

Cheers,
Bert

On Thu, Feb 10, 2011 at 9:55 AM, Robert A LaBudde 
mailto:r...@lcfltd.comr...@lcfltd.com wrote:

 prop.test() is applicable to a binomial experiment in each of two classes.

 Your experiment is binomial only at the subject level. You then have
 multiple subjects in each of your groups.

 You have a random factor Subjects that must be accounted for.

 The best way to analyze is a generalized linear mixed model with a binomial
 distribution family and a logit or probit link. You will probably have to
 investigate overdispersion. If you have a small number of subjects, and
 don't care about the among-subject effect, you can model them as fixed
 effects and use glm() instead.

 Your original question, I believe, related to doing an ANOVA assuming
 normality. In order for this to work with this kind of proportion problem,
 you generally won't get good results unless the number of replicates per
 subject is 12 or more, and the proportions involved are within 
0.15 to 0.85.

 Otherwise you will have biased confidence intervals and significance tests.



 At 07:51 PM 2/9/2011, array chip wrote:

 Content-type: text/plain
 Content-disposition: inline
 Content-length: 2969

 Hi Bert,

 Thanks for your reply. If I understand correctly, prop.test() is not
 suitable to
 my situation. The input to prop.test() is 2 numbers for each group (# of
 success
 and # of trials, for example, groups 1 has 5 success out of 10 trials;
 group 2
 has 3 success out of 7 trials; etc. prop.test() tests whether the
 probability of
 success is the same across groups.

 In my case, each group has several subjects and each subject has 2 numbers
 (#
 success and # trials). So

 for group 1:
 subject 1: 5 success, 10 trials
 subject 2: 3 success, 8 trials
 :
 :

 for group 2:
 subject a: 7 success, 9 trials
 subject b: 6 success, 7 trials
 :
 :

 I want to test whether the probability of success in group 1 is the same
 as in
 group 2. It's like

Re: [R] comparing proportions

2011-02-10 Thread Robert A LaBudde

You need to change models 2 and 3 to use ~ group 
+ subject. You left subject out as a fixed factor.


At 05:17 PM 2/10/2011, array chip wrote:

Robert, thank you!

I tried all 3 models you suggested. Since each 
subject only has one line of data in my dataset, 
would including Subject as a factor in glm() or 
lm() lead to 0 df for resaiduals?


Attached is my dataset and here is my version of the 3 models:

test-read.table(test.txt,sep='\t',header=T)
library(lme4)

## model 1: generalized mixed model
fit1-glmer(cbind(success,failure)~group+(1|subject),data=test,family=binomial)
## model 2: generalized model
fit2-glm(cbind(success,failure)~group,data=test,family=binomial)
## model 3: linear model
fit3-lm(prop.success~group,weights=success+failure,data=test)
 fit1
Generalized linear mixed model fit by the Laplace approximation
Formula: cbind(success, failure) ~ group + (1 | subject)
   Data: test
   AIC   BIC logLik deviance
 54.75 57.89 -24.3848.75
Random effects:
 Groups  NameVariance Std.Dev.
 subject (Intercept) 1.8256   1.3511
Number of obs: 21, groups: subject, 21
Fixed effects:
Estimate Std. Error z value Pr(|z|)
(Intercept)  -0.6785 0.4950  -1.3710.170
groupgroup2  -0.7030 0.6974  -1.0080.313


 summary(fit2)

Call:
glm(formula = cbind(success, failure) ~ group, family = binomial,
data = test)

Deviance Residuals:
Min   1Q   Median   3Q  Max
-2.8204  -2.0789  -0.5407   1.0403   4.0539

Coefficients:
Estimate Std. Error z value Pr(|z|)
(Intercept)  -0.4400 0.2080  -2.115   0.0344 *
groupgroup2  -0.6587 0.3108  -2.119   0.0341 *
---
Signif. codes:  0 â***â 0.001 â**â 0.01 
â*â 0.05 â.â 0.1 â â 1


(Dispersion parameter for binomial family taken to be 1)

Null deviance: 78.723  on 20  degrees of freedom
Residual deviance: 74.152  on 19  degrees of freedom

 summary(fit3)

Call:
lm(formula = prop.success ~ group, data = test, weights = success +
failure)

Residuals:
Min  1Q  Median  3Q Max
-1.1080 -0.7071 -0.2261  0.4754  1.9157

Coefficients:
Estimate Std. Error t value Pr(|t|)
(Intercept)  0.391750.08809   4.447 0.000276 ***
groupgroup2 -0.141750.12364  -1.146 0.265838
---
Signif. codes:  0 â***â 0.001 â**â 0.01 
â*â 0.05 â.â 0.1 â â 1


Residual standard error: 0.8676 on 19 degrees of freedom
Multiple R-squared: 0.0647, Adjusted R-squared: 0.01548
F-statistic: 1.314 on 1 and 19 DF,  p-value: 0.2658



As you can see, all 3 models gave quite 
different results, the model 2 (generalized 
linear model) gave significant p value while 
model 1 (generalized mixed model) and model 3 
(regular linear model) did not. Also as you can 
from the data, prop.success has some value 
outside the range 0.15 to 0.85, so maybe regular 
linear model may not be appropriate?




Thank you,



John






From: Robert A LaBudde r...@lcfltd.com
To: array chip arrayprof...@yahoo.com
Cc: Bert Gunter gunter.ber...@gene.com; r-h...@stat.math.ethz.ch
Sent: Thu, February 10, 2011 12:54:44 PM
Subject: Re: [R] comparing proportions

1. If you use a random effects model, you should 
make Subject the random factor. I.e., a random 
intercepts model with 1|Subject. Group is a 
fixed effect: You have only 2 groups. Even if 
you had more than 2 groups, treating Group as 
random would return a standard deviation, not a 
P-value as you wanted. Finally, I doubt you 
believe the groups used are meaningless, and 
only the population of groups is of interest. 
Instead you consider them special, so Group is a fixed effect.


2. The number of observations for each Subject 
is the number of trials, which you previously 
indicated were 7 to 10 in the cases listed.


3. If you have no interest in the Subject 
effect, you can use a fixed Subject factor 
instead with glm() instead of glmer() or other 
mixed model function. This is a good idea so 
long as the number of subjects is, say, less 
than 10. Otherwise a mixed model would be a better idea.


I suggest you fit all three models to learn 
about what you're doing: 1) glmer() or 
equivalent, with cbind(successes, failures) ~ 
1|Subject + Group; 2) glm() with 
cbind(successes, failures) ~ Subject + Group; 
and 3) lm(p ~ Subject + Group), where p is the 
proportion success for a particular subject and group.


Then compare the results. They will probably all 
3 give the same conclusion to the hypothesis 
question about Group. I would guess the glmer() 
P-value will be larger, then the glm() and 
finally the lm(), but the last two may reverse. 
The lm() model may actually perform fairly well, 
as the Edgeworth series converges rapidly to 
normal for binomial distributions with p within 
0.15 to 0.85 and 10+ replicates, as I stated before.


I'd be interested in seeing the results of these 
3 fits myself just for curiosity.


At 01:21 PM 2/10/2011, array chip wrote:
 Robert and Bert, thank you both very much for 
the response, really

Re: [R] goodness-of-fit test

2010-11-12 Thread Robert A LaBudde


Skew as they are, your data certainly don't look normal. Try lognormal.

The chi-square test gives good results when all counts are 5 or more, 
hence the warning.


At 12:25 AM 11/12/2010, Andrew Halford wrote:

Hi All,

I have a dataset consisting of abundance counts of a fish and I want to test
if my data are poisson in distribution or normal.

My first question is whether it is more appropriate to model my data
according to a poisson distribution (if my test says it conforms) or use
transformed data to normalise the data distribution?

I have been using the vcd package

gf-goodfit(Y,type= poisson,method= MinChisq)

but i get the following error message

Warning message:
In optimize(chi2, range(count)) : NA/Inf replaced by maximum positive value


I then binned my count data to see if that might help

   V1 V2
1   5 34
2  10 30
3  15 10
4  20  8
5  25  7
6  30  0
7  35  3
8  40  2
9  45  3
10 50  1
11 55  0
12 60  1

but still received an error message

 Goodness-of-fit test for poisson distribution

X^2 df P( X^2)
Pearson 2573372 330
Warning message:
In summary.goodfit(gf) : Chi-squared approximation may be incorrect

Am I getting caught out because of zero counts or frequencies in my data?

Andy






--
Andrew Halford Ph.D
Associate Research Scientist
Marine Laboratory
University of Guam
Ph: +1 671 734 2948

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to do bootstrap for the complex sample design?

2010-11-04 Thread Robert A LaBudde


At 01:38 AM 11/4/2010, Fei xu wrote:


Hello;

Our survey is structured as : To be investigated area is divided 
into 6 regions,
within each region, one urban community and one rural community are 
randomly selected,

then samples are randomly drawn from each selected uran and rural community.

The problems is that in urban/rural stratum, we only have one sample.
In this case, how to do bootstrap?

Any comments or hints are greatly appreciated!

Faye


Just make a table of your data, with each row corresponding to a 
measurement. You columns will be Region, UrbanCommunity, 
RuralCommunity and your response variables.


Bootstrap resampling is just generating random row indices into this 
table, with replacement. I.e.,


index- sample(1:N, N, replace=TRUE)

Then your resample is myTable[index,].

Because you chose UrbanCommunity and RuralCommunity randomly, this 
shouldn't be a problem. The fact that you choose a subsample size of 
1 means you won't be able to estimate within-region variances unless 
you make some serious assumptions (e.g., UrbanCommunity effect 
independent of Region effect).



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] When to use bootstrap confidence intervals?

2010-08-16 Thread Robert A LaBudde

1. Bootstrap will do poorly for small sample sizes (less than 25 or 
so). Parametric methods (t) have the advantage of working down to a 
sample size of less than 5.


2. You need to have the number of resamples reasonably large, say 
10,000, for poorly behaved distributions. Otherwise extreme 
percentiles of the resampling distribution used in calculations are 
too inaccurate.


3. You examples are pathological. For the normal case, the t C.I. 
should be optimal, so no surprise there. For the Student t(df=3) 
case, you have a near singular case with only a couple of moments 
that exist. Try something skewed (lognormal) or bimodal (mixture) as 
a better example.


4. BCA general gives the best results, but only when the sample size 
is moderately large (  25) and the number of resamples is large 
(several thousand).


At 07:09 AM 8/16/2010, Mark Seeto wrote:

Hello, I have a question regarding bootstrap confidence intervals.
Suppose we have a data set consisting of single measurements, and that
the measurements are independent but the distribution is unknown. If
we want a confidence interval for the population mean, when should a
bootstrap confidence interval be preferred over the elementary t
interval?

I was hoping the answer would be always, but some simple simulations
suggest that this is incorrect. I simulated some data and calculated
95% elementary t intervals and 95% bootstrap BCA intervals (with the
boot package). I calculated the proportion of confidence intervals
lying entirely above the true mean, the proportion entirely below the
true mean, and the proportion containing the true mean. I used a
normal distribution and a t distribution with 3 df.

library(boot)
samplemean - function(x, ind) mean(x[ind])

ci.norm - function(sample.size, n.samples, mu=0, sigma=1, boot.reps) {
   t.under - 0; t.over - 0
   bca.under - 0; bca.over - 0
   for (k in 1:n.samples) {
 x - rnorm(sample.size, mu, sigma)
 b - boot(x, samplemean, R = boot.reps)
 bci - boot.ci(b, type=bca)
 if (mu  mean(x) - qt(0.975, sample.size - 1)*sd(x)/sqrt(sample.size))
   t.under - t.under + 1
 if (mu  mean(x) + qt(0.975, sample.size - 1)*sd(x)/sqrt(sample.size))
   t.over - t.over + 1
 if (mu  bci$bca[4]) bca.under - bca.under + 1
 if (mu  bci$bca[5]) bca.over - bca.over + 1
   }
   return(list(t = c(t.under, t.over, n.samples - (t.under + 
t.over))/n.samples,

  bca = c(bca.under, bca.over, n.samples - (bca.under +
bca.over))/n.samples))
}

ci.t - function(sample.size, n.samples, df, boot.reps) {
   t.under - 0; t.over - 0
   bca.under - 0; bca.over - 0
   for (k in 1:n.samples) {
 x - rt(sample.size, df)
 b - boot(x, samplemean, R = boot.reps)
 bci - boot.ci(b, type=bca)
 if (0  mean(x) - qt(0.975, sample.size - 1)*sd(x)/sqrt(sample.size))
   t.under - t.under + 1
 if (0  mean(x) + qt(0.975, sample.size - 1)*sd(x)/sqrt(sample.size))
   t.over - t.over + 1
 if (0  bci$bca[4]) bca.under - bca.under + 1
 if (0  bci$bca[5]) bca.over - bca.over + 1
   }
   return(list(t = c(t.under, t.over, n.samples - (t.under + 
t.over))/n.samples,

  bca = c(bca.under, bca.over, n.samples - (bca.under +
bca.over))/n.samples))
}

set.seed(1)
ci.norm(sample.size = 10, n.samples = 1000, boot.reps = 1000)
$t
[1] 0.019 0.026 0.955

$bca
[1] 0.049 0.059 0.892

ci.norm(sample.size = 20, n.samples = 1000, boot.reps = 1000)
$t
[1] 0.030 0.024 0.946

$bca
[1] 0.035 0.037 0.928

ci.t(sample.size = 10, n.samples = 1000, df = 3, boot.reps = 1000)
$t
[1] 0.018 0.022 0.960

$bca
[1] 0.055 0.076 0.869

Warning message:
In norm.inter(t, adj.alpha) : extreme order statistics used as endpoints

ci.t(sample.size = 20, n.samples = 1000, df = 3, boot.reps = 1000)
$t
[1] 0.027 0.014 0.959

$bca
[1] 0.054 0.047 0.899

I don't understand the warning message, but for these examples, the
ordinary t interval appears to be better than the bootstrap BCA
interval. I would really appreciate any recommendations anyone can
give on when bootstrap confidence intervals should be used.

Thanks,
Mark
--
Mark Seeto
National Acoustic Laboratories, Australian Hearing

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal

Re: [R] Function to compute the multinomial beta function?

2010-07-05 Thread Robert A LaBudde


At 05:10 PM 7/5/2010, Gregory Gentlemen wrote:

Dear R-users,

Is there an R function to compute the multinomial beta function? 
That is, the normalizing constant that arises in a Dirichlet 
distribution. For example, with three parameters the beta function 
is Beta(n1,n2,n2) = Gamma(n1)*Gamma(n2)*Gamma(n3)/Gamma(n1+n2+n3)


 beta3- function (n1, n2, n3) 
exp(lgamma(n1)+lgamma(n2)+lgamma(n3)-lgamma(n1+n2+n3))

 beta3(5,3,8)
[1] 1.850002e-07


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Comparing distributions

2010-06-23 Thread Robert A LaBudde


Your * curve apparently dominates your + curve.

If they have the same total number of data each, as you say, they 
both cannot sum to the same value (e.g., N = 1 or 1.000).


So there is something going on that you aren't mentioning.

Try comparing CDFs instead of pdfs.

At 03:33 PM 6/23/2010, Ralf B wrote:

I am trying to do something in R and would appreciate a push into the
right direction. I hope some of you experts can help.

I have two distributions obtrained from 1 datapoints each (about
1 datapoints each, non-normal with multi-model shape (when
eye-balling densities) but other then that I know little about its
distribution). When plotting the two distributions together I can see
that the two densities are alike with a certain distance to each other
(e.g. 50 units on the X axis). I tried to plot a simplified picture of
the density plot below:




|
| *
|  * *
|   *+   *
|  * + +  *
| *+   *   ++  *
| *+* +   *  +   + *
|  *   +   * +   +*
|   *   +   +*
|*   ++*
| *  +  + *
|  *  +   + *
|___


What I would like to do is to formally test their similarity or
otherwise measure it more reliably than just showing and discussing a
plot. Is there a general approach other then using a Mann-Whitney test
which is very strict and seems to assume a perfect match. Is there a
test that takes in a certain 'band' (e.g. 50,100, 150 units on X) or
are there any other similarity measures that could give me a statistic
about how close these two distributions are to each other ? All I can
say from eye-balling is that they seem to follow each other and it
appears that one distribution is shifted by a amount from the other.
Any ideas?

Ralf

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Singularity in simple ANCOVA problem

2010-06-20 Thread Robert A. LaBudde


I'm using R 2.10.1 with the latest version of all packages (updated today).

I'm confused as to why I'm getting a hard singularity in a simple set 
of experimental data:


 blots
   ID Lot Age Conc
1   1   A   3 4.44
2   2   A   3 4.56
3   3   B  41 4.03
4   4   B  41 4.57
5   5   C 229 4.49
6   6   C 229 4.66
7   7   D 238 3.88
8   8   D 238 3.93
9   9   E 349 4.43
10 10   E 349 4.22
11 11   F 391 4.42
12 12   F 391 4.46

 fit2- lm(Conc ~ Age + Lot, data=blots)
 summary(fit2)

Call:
lm(formula = Conc ~ Age + Lot, data = blots)

Residuals:
   Min 1Q Median 3QMax
-2.700e-01 -6.625e-02  6.072e-17  6.625e-02  2.700e-01

Coefficients: (1 not defined because of singularities)
  Estimate Std. Error t value Pr(|t|)
(Intercept)  4.5004639  0.1273234  35.347 3.42e-08 ***
Age -0.0001546  0.0004605  -0.336   0.7485
LotB-0.1941237  0.1706005  -1.138   0.2986
LotC 0.1099485  0.1554378   0.707   0.5059
LotD-0.5586598  0.1558853  -3.584   0.0116 *
LotE-0.1214948  0.1698331  -0.715   0.5013
LotFNA NA  NA   NA
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1787 on 6 degrees of freedom
Multiple R-squared: 0.7464, Adjusted R-squared: 0.535
F-statistic: 3.532 on 5 and 6 DF,  p-value: 0.07811


Why the NA's here?

Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Singularity in simple ANCOVA problem

2010-06-20 Thread Robert A LaBudde


At 08:44 PM 6/20/2010, RICHARD M. HEIBERGER wrote:
You have only one Age value in each Lot.  Hence the Age is aliased 
with one of the

5 degrees of freedom between lots.  You can see it in the picture

xyplot(Conc ~ Age|Lot, data=blots, pch=16)

There is no opportunity for a slope within each of the groups.

Rich


Doh!

Thanks.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Popularity of R, SAS, SPSS, Stata...

2010-06-20 Thread Robert A LaBudde


At 09:01 PM 6/20/2010, Ted Harding wrote:

On 20-Jun-10 19:49:43, Hadley Wickham wrote:
Whales are a different kettle of fish! They are much more directly
observable, in principle, than are R-users. For one thing, a whale
has to come to the surface to breathe every so often, and if you
are in a ship nearby you can see it happen.
snip


Once thing both whales and R users have in common is that, when you 
sight one, you say R, Matey! Thar she blows!



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] logistic regression with 50 varaibales

2010-06-14 Thread Robert A LaBudde

 degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: cbelei...@units.it

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] glm poisson function

2010-06-10 Thread Robert A LaBudde


 prop.test(c(271,433),c(6164113,5572041))

2-sample test for equality of proportions with continuity correction

data:  c(271, 433) out of c(6164113, 5572041)
X-squared = 54.999, df = 1, p-value = 1.206e-13
alternative hypothesis: two.sided
95 percent confidence interval:
 -4.291428e-05 -2.457623e-05
sample estimates:
  prop 1   prop 2
4.396415e-05 7.770941e-05



At 06:36 AM 6/10/2010, Phender79 wrote:


Hi,

I'm totally new to R so I apologise for the basic request. I am looking at
the incidence of a disease over two time periods 1990-1995 and 2003-2008. I
have counts for each year, subdivided into three disease categories and by
males/females.
I understand that I need to analyse the data using poisson regression and
have managed to use the pois.daly function to get age-sex adjusted rates and
corresponding confidence intervals. However, I now want to know how get a p
value (I'm writing up a paper for a journal) to say that x number of cases
in the first cohort (1990-1995) is significantly lower than y number in the
later cohort (2003-2008). I also want to make sure that I've corrected for
overdispersion etc.
I'm totally stuck and can't think where to start with writing a script. So
basically my question is:
e.g. I have 271 cases of the disease between 1990-1995 (total population at
risk over six years = 6,164,113) and 433 cases between 2003-2008 (total
population at risk over sic year = 5,572,041) - is this significant and what
is the P value.
Any help much appreciated!
Cheers
P
--
View this message in context: 
http://r.789695.n4.nabble.com/glm-poisson-function-tp2250210p2250210.html

Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] When normality test violate and sample size is large(n=400) can I use t-test

2010-06-01 Thread Robert A LaBudde


At 11:45 AM 6/1/2010, Kourosh Ks wrote:

Dears
 When normality test  violate and sample size is large(n=400) can I 
use t-test?

best gards kourosh


Generally yes, unless there is something really pathological about 
the distribution.


You should note that, for n = 400, even the simplest 
distribution-free tests should have near perfect power, so which test 
you use is not important.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Histogram Bin

2010-05-14 Thread Robert A LaBudde


 x- rnorm(200)
 hist(x, 18)
 str(hist(x, 18))
List of 7
 $ breaks : num [1:15] -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 ...
 $ counts : int [1:14] 3 1 8 12 34 35 40 30 18 11 ...
 $ intensities: num [1:14] 0.03 0.01 0.08 0.12 0.34 ...
 $ density: num [1:14] 0.03 0.01 0.08 0.12 0.34 ...
 $ mids   : num [1:14] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 
0.75 1.25 1.75 ...

 $ xname  : chr x
 $ equidist   : logi TRUE
 - attr(*, class)= chr histogram
 hist(x, 18, plot=FALSE)$breaks
 [1] -3.0 -2.5 -2.0 -1.5 -1.0 
-0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0



At 09:55 AM 5/14/2010, Research wrote:

Hello,

Is there a function that returns the number of the bin (or 
quantile, or percentile etc. etc.) that a value of a variable may belong to?


Tor example:

breaks-hist(variable, 18, plot=FALSE)

If the following breaks are

 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85

the boundaries of successive bins of a histogram, then value 6 
belongs to the 2nd bin.


Best regards,
Costas

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Cubic B-spline, how to numerically integrate?

2010-05-14 Thread Robert A LaBudde

It's a piecewise cubic function. All you need are the knots and the
coefficients. You can then right down the analytic formula for the integral.

It doesn't need to be numerically integrated.

At 11:57 AM 5/14/2010, David Winsemius wrote:

On May 14, 2010, at 11:41 AM, Claudia Penaloza wrote:

(corrected version of previous posting)

I fit a GAM to turtle growth data following methods in Limpus
Chaloupka
1997 (http://www.int-res.com/articles/meps/149/m149p023.pdf).

I want to obtain figures similar to Fig 3 c f in Limpus Chaloupka
(1997), which according to the figure legend are expected size-at-age
functions obtained by numerically integrating size-specific growth
rate
functions derived using cubic B-spline fit to GAM predicted values.

I was able to fit the cubic-B spline, but I do not know how to
numerically
integrate it.

You need to give us the function and the appropriate limits of
integration.

Can anybody help please?

Code and figures here:
https://docs.google.com/fileview?id=0B1cQ7z9xYFl2ZTZhMmMyMjAtYTA3Zi00N2QyLTkxNzMtOGYyMjdiOGU2ZWE4hl=en

Thank you,
Claudia

Vere scire est per causas scire

Re: [R] P values

2010-05-07 Thread Robert A LaBudde


At 07:10 AM 5/7/2010, Duncan Murdoch wrote:

Robert A LaBudde wrote:

At 01:40 PM 5/6/2010, Joris Meys wrote:


On Thu, May 6, 2010 at 6:09 PM, Greg Snow greg.s...@imail.org wrote:



Because if you use the sample standard deviation then it is a t test not a
z test.



I'm doubting that seriously...

You calculate normalized Z-values by substracting the sample mean and
dividing by the sample sd. So Thomas is correct. It becomes a Z-test since
you compare these normalized Z-values with the Z distribution, instead of
the (more appropriate) T-distribution. The T-distribution is essentially a
Z-distribution that is corrected for the finite sample size. In Asymptopia,
the Z and T distribution are identical.



And it is only in Utopia that any P-value less than 0.01 actually 
corresponds to reality.



I'm not sure what you mean by this.  P-values are simply statistics 
calculated from the data; why wouldn't they be real if they are small?


Do you truly believe an actual real-life distribution accurately is 
fit by a normal distribution at quantiles of 0.001, 0.0001 or beyond?


The map is not the territory, and just because you can calculate 
something from a model doesn't mean it's true.


The real world is composed of mixture distributions, not pure ones.

The P-value may be real, but its reality is subordinate to the 
distributional assumption involved, which always fails at some level. 
I'm simply asserting that level is in the tails at probabilities of 
0.01 or less.


Statisticians, even eminent ones such as yourself and lesser lights 
such as myself, frequently fail to keep this in mind. We accept such 
assumptions as normality, equal variances, etc., on an 
eyeballometric basis, without any quantitative understanding of 
what this means about limitations on inference, including P-values.


Inference in statistics is much cruder and more judgmental than we 
like to portray. We should at least be honest among ourselves about 
the degree to which our hand-waving assumptions work.


I remember at the O. J. Simpson trial, the DNA expert asserted that a 
match would occur only once in 7 billion people. I wondered at the 
time how you could evaluate such an assertion, given there were less 
than 7 billion people on earth at the time.


When I was at a conference on optical disk memories when they were 
being developed, I heard a talk about validating disk specifications 
against production. One statement was that the company would also 
validate the undetectable error rate specification of 1 in 10^16 
bits. I amusingly asked how they planned to validate the 
undetectable error rate. The response was handwaving and Just as 
we do everything else. The audience laughed, and the speaker didn't 
seem to know what the joke was.


In both these cases the values were calculable, but that didn't mean 
that they applied to reality.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] P values

2010-05-06 Thread Robert A LaBudde


At 01:40 PM 5/6/2010, Joris Meys wrote:

On Thu, May 6, 2010 at 6:09 PM, Greg Snow greg.s...@imail.org wrote:

 Because if you use the sample standard deviation then it is a t test not a
 z test.


I'm doubting that seriously...

You calculate normalized Z-values by substracting the sample mean and
dividing by the sample sd. So Thomas is correct. It becomes a Z-test since
you compare these normalized Z-values with the Z distribution, instead of
the (more appropriate) T-distribution. The T-distribution is essentially a
Z-distribution that is corrected for the finite sample size. In Asymptopia,
the Z and T distribution are identical.


And it is only in Utopia that any P-value less than 0.01 actually 
corresponds to reality.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random numbers with PDF of user-defined function

2010-04-29 Thread Robert A LaBudde


At 05:40 AM 4/29/2010, Nick Crosbie wrote:

Hi,

In S+/R, is there an easy way to generate random numbers with a
probability distribution specified by an exact user-defined function?

For example, I have a function:

f(x) = 1/(365 * x), which should be fitted for values of x between 1 and
100,000

How do I generate random numbers with a probability distribution that
exactly maps the above function?

Nick


First of all, your pdf should be f(x) = 1 / [x log(10)], if x is 
continuous.


Second, compute the cdf as F(x) = ln(x) / log(10).

Third, compute the inverse cdf as G(p) = exp[p log(10)]

Finally, to generate random variates, use G(u), where u is a uniform 
random variate in [0,1].


In R,

 G- function (p) exp(p*log(10))
 G(runif(5))
[1] 11178.779736  9475.748549 65939.487801 94914.354479 1.694695



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] From THE R BOOK - Warning: In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!

2010-03-30 Thread Robert A LaBudde


At 12:08 PM 3/30/2010, David Winsemius wrote:

snip
I don't understand this perspective. You bought Crowley's book so he
is in some minor sense in debt to you.  Why should you think it is
more appropriate to send your message out to thousands of readers of 
r- help around the world (some of whom have written books that you did

not buy) before sending Crowley a question about his text?


In fairness to Michael Crawley, whose books are useful and very clear 
(although not well-liked on this list for some reason):


1. The example quoted by Corrado Topi is not an actual example. 
Instead is an isolated line of code given to illustrate the 
simplicity of glm() syntax and its relation to lm() syntax. This is 
in a short general topic overview chapter on GLMs meant to introduce 
concepts and terminology, not runnable code.


2. The example chapter is followed in the book by individual 
chapters on each type of GLM covered (count data, count data in 
tables, proportion data, binary response variables). If Corrado Topi 
had looked in the relevant chapter, he would find numerous worked out 
examples with runnable code.


Corrado Topi made an error in trying to run an isolated line of code 
without antecedent definitions, which almost never works in any 
programming system. Michael Crawley made a mistake in judgment in 
assuming that detail later will suffice for generality now.


My advice to Corrado Topi is engage in some forward referencing, and 
read chapters 16 and 17 before deciding which example code to run.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Nonparametric generalization of ANOVA

2010-03-05 Thread Robert A LaBudde


A search on bluesky...@gmail.com shows the user is in Norfolk, VA, USA.

At 01:26 PM 3/5/2010, John Sorkin wrote:
The sad part of this interchanges is that Blue Sky does not seem to 
be amiable to suggestion. He, or she, has not taken note, or 
responded to the fact that a number of people believe it is good 
manners to give a real name and affiliation. My mother taught me 
that when two people tell you that you are drunk you should lie down 
until the inebriation goes away. Blue Sky, several people have noted 
that you would do well to give us your name and affiliation. Is this 
too much to ask given that people are good enough to help you?

John




John David Sorkin M.D., Ph.D.
Chief, Biostatistics and Informatics
University of Maryland School of Medicine Division of Gerontology
Baltimore VA Medical Center
10 North Greene Street
GRECC (BT/18/GR)
Baltimore, MD 21201-1524
(Phone) 410-605-7119
(Fax) 410-605-7913 (Please call phone number above prior to 
faxing) Matthew Dowle mdo...@mdowle.plus.com 3/5/2010 12:58 PM 

Frank, I respect your views but I agree with Gabor.  The posting guide does
not support your views.

It is not any of our views that are important but we are following the
posting guide.  It covers affiliation. It says only that some consider it
good manners to include a concise signature specifying affiliation. It
does not agree that it is bad manners not to.  It is therefore going too far
to urge R-gurus, whoever they might be, to ignore such postings on that
basis alone.  It is up to responders (I think that is the better word which
is the one used by the posting guide) whether they reply.  Missing
affiliation is ok by the posting guide.  Users shouldn't be put off from
posting because of that alone.

Sending from an anonymous email address such as BioStudent is also fine by
the posting guide as far as my eyes read it. It says only that the email
address should work. I would also answer such anonymous posts, providing
they demonstrate they made best efforts to follow the posting guide, as
usual for all requests for help.  Its so easy to send from a false, but
apparently real name, why worry about that?

If you disagree with the posting guide then you could make a suggestion to
get the posting guide changed with respect to these points.  But, currently,
good and practice is defined by the posting guide, and I can't see that your
view is backed up by it.  In fact it seems to me that these points were
carefully considered, and the wording is careful on these points.

As far as I know you are wrong that there is no moderator.  There are in
fact an uncountable number of people who are empowered to moderate i.e. all
of us. In other words its up to the responders to moderate.  The posting
guide is our guide.  As a last resort we can alert the list administrator
(which I believe is the correct name for him in that role), who has powers
to remove an email address from the list if he thinks that is appropriate,
or act otherwise, or not at all.  It is actually up to responders (i.e. all
of us) to ensure the posting guide is followed.

My view is that the problems started with some responders on some occasions.
They sometimes forgot, a little bit, to encourage and remind posters to
follow the posting guide when it was not followed. This then may have
encouraged more posters to think it was ok not to follow the posting guide.
That is my own personal view,  not a statistical one backed up by any
evidence.

Matthew


Frank E Harrell Jr f.harr...@vanderbilt.edu wrote in message
news:4b913880.9020...@vanderbilt.edu...
 Gabor Grothendieck wrote:
 I am happy to answer posts to r-help regardless of the name and email
 address of the poster but would draw the line at someone excessively
 posting without a reasonable effort to find the answer first or using
 it for homework since such requests could flood the list making it
 useless for everyone.

 Gabor I respectfully disagree.  It is bad practice to allow anonymous
 postings.  We need to see real names and real affiliations.

 r-help is starting to border on uselessness because of the age old problem
 of the same question being asked every two days, a high frequency of
 specialty questions, and answers given with the best of intentions in
 incremental or contradictory e-mail pieces (as opposed to a cumulative
 wiki or hierarchically designed discussion web forum), as there is no
 moderator for the list.  We don't need even more traffic from anonymous
 postings.

 Frank


 On Fri, Mar 5, 2010 at 10:55 AM, Ravi Varadhan rvarad...@jhmi.edu
 wrote:
 David,

 I agree with your sentiments.  I also think that it is bad posting
 etiquette not to sign one's genuine name and affiliation when asking for
 help, which blue sky seems to do a lot.  Bert Gunter has already
 raised this issue, and I completely agree with him. I would also like to
 urge the R-gurus to ignore such postings.

 Best,
 Ravi.

Re: [R] Help with simple bootstrap test

2010-02-25 Thread Robert A LaBudde

The boot() function in the 'boot' package expects to find a function 
for the statistic with two arguments: The data object plus a row index object.


You don't indicate enough to see how you will be resampling. It you 
sum all elements in your table, resampling would have to be one of:


1. A sample with replacement of rows, or
2. A sample with replacement of columns, or
3. A sample with replacement of elements in the whole table.

Assuming you want sampling with replacement of rows:

sumf- function (x, i) { sum(x[i,]) }
result- boot(mytable, sumf, 1000)
boot.ci(result, type='bca')

A simulation:

 mytable- matrix(rnorm(2000), ncol=20)
 sum(mytable)
[1] -14.92842
 sumf- function(x,i) sum(x[i,])
 require('boot')
Loading required package: boot
 b1- boot(mytable, sumf, 1000)
 boot.ci(b1, type='bca')
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL :
boot.ci(boot.out = b1, type = bca)

Intervals :
Level   BCa
95%   (-101.49,   85.20 )
Calculations and Intervals on Original Scale

At 01:28 PM 2/25/2010, xonix wrote:

Hi all
Forgive me, I'm a total R newbie, and this seems to be a straightforward
simple bootstrap problem, but after a whole day of trying to figure out how
to do it I'm ready to give up. Part of the problem is that every example and
every help page seems to be about doing something more far more complex.


I'm got a table with 40 columns and 750 rows. I sum all the values across
the whole table (and subsets of the columns). I want to bootstrap to get the
95% confidence intervals for that sum value.

result - boot(table, function, 1000)
boot.ci (result, bca)

It seems to me that the 'function' is something to sum all the columns and
rows of the table (or a subset should I desire). I've tried writing 'sum'
for the function, but this gives me a huge figure which can't possibly be
right.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] multiple-test correlation

2010-02-14 Thread Robert A LaBudde


At 07:35 AM 2/14/2010, Manuel Jesús López Rodríguez wrote:

Dear all,
I am trying to study the correlation between one 
independent variable (V1) and several others 
dependent among them (V2,V3,V4 and V5). 
For doing so, I would like to analyze my data by 
multiple-test (applying the Bonferroni´s 
correction or other similar), but I do not find 
the proper command in R. What I want to do is to 
calculate Kendall´s correlation between V1 and 
the others variables (i.e. V1 vs V2, V1 vs 
V3, etc.) and to correct the p values by 
Bonferroni or other. I have found 
outlier.test, but I do not know if this is 
what I need (also, I would prefer to use a less 
conservative method than Bonferroni´s, if possible).

Thank you very much in advance!


One approach might be to first test for any 
correlations via a likelihood ratio test:


Ho: P = I   (no correlations) or covariances are diagonal
T = -a ln V ~ chi-square [p(p-1)/2]

where

V = det(R)
a = N -1 - (2 p +5)/6   N = # data
p = # variables

Reject Ho if T  X^2 (alpha, p(p-1)/2)

Then do the pairwise tests without familywise 
error control. I.e., this is similar to doing the 
F test in ANOVA before doing LSD testing.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] random effects in mixed model not that 'random'

2009-12-13 Thread Robert A LaBudde

acting as grouping variable for a random effect intercept. If I have no
knowledge about the schools, the random effect assumption makes sense.
If I however investigate the schools in detail (either a priori or a
posterior), say teaching quality of the teachers, socio-economic status
of the school area etc, it will probably make sense to predict which
ones will have pupils performing above average, and which below average.
However then probably these factors leading me to the predictions should
enter the model as fixed effects, and maybe I don't need and school
random effect any more at all. But this means actually the school
deviation from the global mean is not the realization of a random
variable, but instead the result of something quite deterministic, but
which is usually just unknown, or can only be measured with extreme,
impractical efforts.  So the process might not be random, just because
so little is known about the process, the results appear as if they
would be randomly drawn (from a larger population distribution). Again,
is ignorance / lack of deeper knowledge the key to using random effects
- and the more knowledge I have, the less ?

many thanks,
Thomas

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Implementation of gamma function for large number

2009-10-24 Thread Robert A LaBudde


At 04:53 PM 10/24/2009, R_help Help wrote:

Hi - I ran into a problem when the argument to gamma function is
large. Says, if I do the following:

 gamma(17000)
[1] Inf
Warning message:
value out of range in 'gammafn'

Is there anyway to get around this or other implementation? Thank you.

-rc


Try the log of the gamma function instead:

 ? gamma
 lgamma(17000)
[1] 148592.5
 (17000-0.5)*log(17000)-17000+0.5*log(2*pi)
[1] 148592.5

Note that, for this size number, the first few terms of the usual 
expansions gives the answer to 7 significant figures.


I will restrain myself from asking the obvious question of what 
possible use gamma(17000) could be to you, and why a simple 
infinity would not work just as well. I'm sure you must have good 
reason for your request.





Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Confusion on use of permTS() in 'perm'

2009-09-08 Thread Robert A. LaBudde


Consider the following simple example using R-2.9.0 and 'perm' 2.9.1:

 require('perm')
 p- c(15,21,26,32,39,45,52,60,70,82)
 g- c('y','n','y','y', rep('n',6)) #Patients ranked 1,3,4 receive treatment
 permTS(p ~ g, alternative = 'two.sided', method='exact.ce') #find 
p-value by complete enumeration


Exact Permutation Test (complete enumeration)

data:  p by g
p-value = 0.05
alternative hypothesis: true mean of g=n minus mean of g=y is not equal to 0
sample estimates:
mean of g=n minus mean of g=y
 28.38095

The permutation observed is '134', which has a rank sum of 8. Other 
permutations with rank sums of 8 or less are '123', '124' and '125'. 
So there are a total of 4 out of 4! = 120 possible, or a one-tail 
p-value of 4/120 = 0.0333, or a 2-tail p-value of 2*4/120 = 0.067. 
This is not, however, what permTS() returns. The permTS() value of 
0.05 appears to correspond to 3 patterns, not 4.


I am misunderstanding how to solve this simple problem, or is 
something going on with permTS() that I'm missing.


Thanks.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Problem accessing functions in package 'roxygen'

2009-09-04 Thread Robert A LaBudde


Thanks, Uwe. Unfortunately, that doesn't work either:

 library('roxygen')
Warning message:
package 'roxygen' was built under R version 2.9.1
 roxygen::trim(' 1234 ')
Error: 'trim' is not an exported object from 'namespace:roxygen'

I ended up using

trim - function(x) gsub(^[[:space:]]+|[[:space:]]+$, , x)

instead.

At 01:42 PM 9/3/2009, Uwe Ligges wrote:



Robert A. LaBudde wrote:
I have Vista Home with R-2.9.0, and installed and tried to test the 
package 'roxygen':

  utils:::menuInstallPkgs()
trying URL 
'http://lib.stat.cmu.edu/R/CRAN/bin/windows/contrib/2.9/roxygen_0.1.zip'

Content type 'application/zip' length 699474 bytes (683 Kb)
opened URL
downloaded 683 Kb
package 'roxygen' successfully unpacked and MD5 sums checked
The downloaded packages are in
C:\Users\RAL\AppData\Local\Temp\RtmpZPlILq\downloaded_packages
updating HTML package descriptions
Warning message:
In file.create(f.tg) :
  cannot create file 
'C:\PROGRA~1\R\R-29~1.0/doc/html/packages.html', reason 'Permission denied'

  library('roxygen')
Warning message:
package 'roxygen' was built under R version 2.9.1
  trim(  1234)
Error: could not find function trim
I have a similar problem with trim.right(), trim.left() and other 
functions I've tried.

Any ideas?


Probably it is not intende to call trim and friends like that, 
because they are not exported from roxygen's namespace, hence you 
could use roxygen:::trim(  1234)


Uwe Ligges




Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947
Vere scire est per causas scire
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Problem accessing functions in package 'roxygen'

2009-09-03 Thread Robert A. LaBudde

I have Vista Home with R-2.9.0, and installed and tried to test the 
package 'roxygen':


 utils:::menuInstallPkgs()
trying URL 
'http://lib.stat.cmu.edu/R/CRAN/bin/windows/contrib/2.9/roxygen_0.1.zip'

Content type 'application/zip' length 699474 bytes (683 Kb)
opened URL
downloaded 683 Kb

package 'roxygen' successfully unpacked and MD5 sums checked

The downloaded packages are in
C:\Users\RAL\AppData\Local\Temp\RtmpZPlILq\downloaded_packages
updating HTML package descriptions
Warning message:
In file.create(f.tg) :
  cannot create file 
'C:\PROGRA~1\R\R-29~1.0/doc/html/packages.html', reason 'Permission denied'

 library('roxygen')
Warning message:
package 'roxygen' was built under R version 2.9.1
 trim(  1234)
Error: could not find function trim

I have a similar problem with trim.right(), trim.left() and other 
functions I've tried.


Any ideas?

Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] online classes or online eduction in statistics? esp. time series analysis and cointegration?

2009-08-31 Thread Robert A LaBudde

http://www.statistics.com/ for online courses without college credit, 
taught by well known faculty typically using their own books.




At 01:50 AM 8/31/2009, Luna Laurent wrote:

Hi all,

I am looking for low cost online education in statistics. I am thinking of
taking online classes on time series analysis and cointegration, etc.

Of course, if there are free video lectures, that would be great. However I
couldn't find any free video lectures at upper-undergraduate and graduate
level which formally going through the whole timeseries education... That's
why I would like to enroll in some sort of online degree classes. However, I
don't want to earn the certificate or the degree; I just want to audit the
online class specifically in time series analysis and cointegration. Could
anybody recommend such online education in statistics esp. in time series
and cointegration, at low cost? Hopefully it's not going to be like a few
thousand dollars for one class.

Thanks a lot for your pointers in advance!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] transformation/link glm in R

2009-08-18 Thread Robert A LaBudde

1. It sounds as if you say left skewed when you mean right skewed 
and v.v. Right skewed distributions have a long tail to the right.
2. Your 3rd variable may be censored, as you indicate it is bounded 
at 8 hr and that is also the mode. If so, this will complicate your analysis.
3. Your variables would at first sight appear to be continuous, yet 
you are modeling them as counts. Why? Did you discretize the times?
4. I suggest you collect the standard transforms related to the 
different distributions, and do some EDA where you plot the 
distributions of the transformed data. How much does unimodal 
symmetry improve? This might point you towards the right family. 
Using the Box-Cox method to find a power transform may be useful in 
down-selecting.
5. You might consider the gamma distribution with a 1/x link for the 
first two varibles.

6. A beta distribution could probably model all 3 of your variables.

You have given insufficient information regarding your experiment and 
its response variables to allow accurate advice to be supplied. Where 
do the bounds come from? Are they arbitrary experimental cut-offs, 
causing censoring? If so, would you be better to use survival 
analysis (e.g., coxph() instead of glm)? Etc.


At 06:52 AM 8/18/2009, Mcdonald, Grant wrote:

Dear sir,

I have 3 different time response variables that are in the form 
seconds.  All three response variables are not normally 
distributed.  They are in the form of mate latency,  the first two 
responses are bounded to 30mins and thwe third is bounded to 8 
hours.  Frequency plots of raw data show that the first two are 
heavily skewed to the left and the third (bounded at 8hrs) is 
heavily skwewed to the right with most data points being 8 hours 
long.  I am unsure of using the appropriate transformations in R and 
have only found appropriate trandformations for count data 
(glm(time~x*x1*x2, family=poisson) and proportion data 
(glm(time~x*x1*x2, family=binomial).  Would it be possible to point 
me in the right direction of the appropriate glm family to use for 
such data or should I use some transformation seperately and use 
anova in R of which i am more confident of the code



Sorry if this is an innapropriate question but I would 
greatly  appreciate advise,


G. Colin

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] bootstrapped correlation confint lower than -1 ?

2009-08-16 Thread Robert A LaBudde

The basic interval (which you are using by default) is just the 
usually normal distribution interval with the standard error 
estimated from the resampling distribution. It has all the benefits 
and problems of assuming a normal distribution.


If you wish to maintain the domain limits on the correlation, use the 
percentile method, which would use the 2.5% and 97.5% quantiles of 
the resampling distribution.


Because you clearly have a non-central correlation (-0.8), you would 
get more accurate results with the BCa method, which does bias and 
skew correction of the resampling distribution.


At 06:22 AM 8/16/2009, Liviu Andronic wrote:

Dear R users,
Does the results below make any sense? Can the the interval of the
correlation coefficient be between  *-1.0185* and -0.8265 at 95%
confidence level?
Liviu

 library(boot)
 data(mtcars)
 with(mtcars, cor.test(mpg, wt, met=spearman))

Spearman's rank correlation rho

data:  mpg and wt
S = 10292, p-value = 1.488e-11
alternative hypothesis: true rho is not equal to 0
sample estimates:
 rho
-0.88642

Warning message:
In cor.test.default(mpg, wt, met = spearman) :
  Cannot compute exact p-values with ties
 corr.fun - function(data, indices) {
+ data.samp - data[indices,] # allows boot to select sample
+ tmp.corr - with(data.samp, cor.test(mpg, wt, met=spearman))
+ return(tmp.corr$estimate)
+ }
 corr.boot - boot(mtcars, corr.fun, R=1000)
There were 50 or more warnings (use warnings() to see the first 50)
 corr.boot

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = mtcars, statistic = corr.fun, R = 1000)


Bootstrap Statistics :
original   biasstd. error
t1* -0.88642 0.0129920.050042
 boot.ci(corr.boot, type=basic)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL :
boot.ci(boot.out = corr.boot, type = basic)

Intervals :
Level  Basic
95%   (-1.0185, -0.8265 )
Calculations and Intervals on Original Scale








--
Do you know how to read?
http://www.alienetworks.com/srtest.cfm
Do you know how to write?
http://garbl.home.comcast.net/~garbl/stylemanual/e.htm#e-mail

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to test frequency independence (in a 2 by 2 table) with many missing values

2009-07-24 Thread Robert A LaBudde

1. Drop all subjects where both test results are not present. These 
are uninformative on the difference.


2. Perform McNemar test on the remaining table. (Only the different 
pairs are informative.)


This will give you a p-value on the data that actually contrasts the 
two tests. (For your example, the p-value is


 X2- data.frame(x1.m, x2.m)
 tX2- table(X2)
 tX2
x2.m
x1.m  0  1
   0  6  5
   1 15  4
 mcnemar.test(tX2)

McNemar's Chi-squared test with continuity correction

data:  tX2
McNemar's chi-squared = 4.05, df = 1, p-value = 0.04417


The low power is a consequence of the data not being informative. 
You'll have to live with it.


If you want to squeeze any more out this data, I'd guess about the 
only way to do it would be via a maximum likelihood approach with 
marginals for the one test only data, an assumption of, e.g., a 
binomial distribution for each test, and then use a likelihood ratio 
test for p1=p2 vs. p1!=p2.


If you really feel up to it, you could program a randomization or 
bootstrap test equivalent to the ML approach, maintaining the 
marginal totals involved.


At 01:23 PM 7/24/2009, Tal Galili wrote:

Hello dear R help group.

My question is statistical and not R specific, yet I hope some of you might
be willing to help.

*Experiment settings*:  We have a list of subjects. each of them went
through two tests with the answer to each can be either 0 or 1.
*Goal*: We want to know if the two experiments yielded different results in
the subjects answers.
*Statistical test (and why it won't work)*: Naturally we would turn to
performing a mcnemar test. But here is the twist: we have missing values in
our data.
For our purpose, let's assume the missingnes is completely at random, and we
also have no data to use for imputation. Also, we have much more missing
data for experiment number 2 then in experiment number 1.

So the question is, under these settings, how do we test for experiment
effect on the outcome?

So far I have thought of two options:
1) To perform the test only for subjects that has both values. But they are
too scarce and will yield low power.
2) To treat the data as independent and do a pearson's chi square test
(well, an exact fisher test that is) on all the non-missing data that we
have. The problem with this is that our data is not fully independent (which
is a prerequisite to chi test, if I understood it correctly). So I am not
sure if that is a valid procedure or not.

Any insights will be warmly welcomed.


p.s: here is an example code producing this scenario.

set.seed(102)

x1 - rbinom(100, 1, .5)
x2 - rbinom(100, 1, .3)

X - data.frame(x1,x2)
tX - table(X)
margin.table(tX,1)
margin.table(tX,2)
mcnemar.test(tX)

put.missings - function(x.vector, na.percent)
{
turn.na - rbinom(length(x.vector), 1, na.percent)
 x.vector[turn.na == 1] - NA
return(x.vector)
}


x1.m - put.missings(x1, .3)
x2.m - put.missings(x2, .6)

tX.m - rbind(table(na.omit(x1.m)), table(na.omit(x2.m)))
fisher.test(tX.m)




With regards,
Tal









--
--


My contact information:
Tal Galili
Phone number: 972-50-3373767
FaceBook: Tal Galili
My Blogs:
http://www.r-statistics.com/
http://www.talgalili.com
http://www.biostatistics.co.il

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to test frequency independence (in a 2 by 2 table) withmany missing values

2009-07-24 Thread Robert A LaBudde

 (and why it won't work)*: Naturally we would turn to
performing a mcnemar test. But here is the twist: we have missing values in
our data.
For our purpose, let's assume the missingnes is completely at random, and we
also have no data to use for imputation. Also, we have much more missing
data for experiment number 2 then in experiment number 1.

So the question is, under these settings, how do we test for experiment
effect on the outcome?

So far I have thought of two options:
1) To perform the test only for subjects that has both values. But they are
too scarce and will yield low power.
2) To treat the data as independent and do a pearson's chi square test
(well, an exact fisher test that is) on all the non-missing data that we
have. The problem with this is that our data is not fully independent (which
is a prerequisite to chi test, if I understood it correctly). So I am not
sure if that is a valid procedure or not.

Any insights will be warmly welcomed.


p.s: here is an example code producing this scenario.

set.seed(102)

x1 - rbinom(100, 1, .5)
x2 - rbinom(100, 1, .3)

X - data.frame(x1,x2)
tX - table(X)
margin.table(tX,1)
margin.table(tX,2)
mcnemar.test(tX)

put.missings - function(x.vector, na.percent) { turn.na -
rbinom(length(x.vector), 1, na.percent)  x.vector[turn.na == 1] - NA
return(x.vector)
}


x1.m - put.missings(x1, .3)
x2.m - put.missings(x2, .6)

tX.m - rbind(table(na.omit(x1.m)), table(na.omit(x2.m)))
fisher.test(tX.m)




With regards,
Tal









--
--


My contact information:
Tal Galili
Phone number: 972-50-3373767
FaceBook: Tal Galili
My Blogs:
http://www.r-statistics.com/
http://www.talgalili.com
http://www.biostatistics.co.il

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Getting identify() to work with lattice::cloud()

2009-07-02 Thread Robert A. LaBudde

I don't seem to be able to put any syntax into identify() that gets 
it to work with lattice cloud() graph:


layout(1)
require('lattice')
cloud(g3 ~ g1 + g2, data=gapp, col = blue,
  xlab='G1 Score', ylab='G2 Score',
  zlab='G3 Score', main='3D Plot for Applicants')
identify(gapp[,2:4], labels=gapp$ID)

Here g1 is gapp[,2], g2 is gapp[,3] and g3 is gapp[,4]. gapp is a data frame.

I've tried several variants in the first arguments of identify(), but 
no matter what I use, when I click on a point I get warning: no 
point within 0.25 inches.


Does identify() work with cloud()? If so, what is the correct syntax.

Thanks.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Comparing model fits for NLME when models are not nested

2009-06-12 Thread Robert A LaBudde


At 05:42 AM 6/12/2009, Lindsay Banin wrote:

Hi there,

I am looking to compare nonlinear mixed effects models that have 
different nonlinear functions (different types of growth 
curve)embedded. Most of the literature I can find focuses on 
comparing nested models with likelihood ratios and AIC. Is there a 
way to compare model fits when models are not nested, i.e. when the 
nonlinear functions are not the same?


Transform back into original units, if necessary, and compare 
distributions of and statistics of residuals from fitted values in 
original units.


This is not a significance-test, but instead a measure of the better 
approximation to the observed model.


Types of measures: 1) rms residual, 2) max absolute residual, 3) mean 
absolute residual.


In my opinion, models should be chosen based on the principles of 
causality (theory), degree of approximation and parsimony. None of 
these involve significance testing.


Choosing models based upon significance testing (which merely 
identifies whether or not the experiment is large enough to 
distinguish an effect clearly) amounts to admitting intellectually 
that you have no subject matter expertise, and you must therefore 
fall back on the crumbs of significance testing to get glimmers of 
understanding of what's going on. (Much like stepwise regression techniques.)


As an example, suppose you have two models, one with 5 parameters and 
one with only 1. The rms residual error for the two models are 0.50 
and 0.53 respectively. You have a very large study, and all 4 
additional parameters are significant at p = 0.01 or less.  What 
should you do? What I would do is select the 1 parameter study as my 
baseline model. It will be easy to interpret physically, will 
generalize to other studies much better (stable), and is almost 
identical in degree of approximation as the 5 parameter model. I 
would be excited that a one parameter model could do this. The fact 
that the other 4 parameters have detectable effects at a very low 
level is not important for modeling the study, but may conceivably 
have some special significance on their own for future investigations.


So not being able to do significance testing on non-nested models is 
not that big a loss, in my opinion. Such tests encourage wrong 
thinking, in my opinion.


What I've expressed as an opinion here (which I am sure some will 
disagree with) is similar to the philosophy of choosing the number of 
principal components to use, or number of latent factors in factor 
analysis. What investigation do people ever do on the small 
eigenvalue principal components, even if their contributions are 
statistically significant?



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] time series, longitudinal data or trajectories

2009-06-06 Thread Robert A LaBudde


At 04:54 AM 6/6/2009, Christophe Genolini wrote:

Thanks for yours answers. So if I understand:
- Trajectories are continuous, the other are discrete.
- The difference between time series and longitudinal is that time 
series are made at regular time whereas longitudinal are not ?

- Repeated measures are over a short period of time.


So if I measure the  weight of my patient daily during one week, it 
will be repeated measure ; if I measure it once a week during one year,
it will time series ; if I measure it once a week during one year 
but with some missing week, it will longitudinal data ?

Well I guess it is not as simple at that, but is it the idea ?
snip


Not exactly.

If you measure weight daily for a week, that is a time series 
(equally spaced time measurements over an arbitrary period) and 
repeated measurements (multiple measurements on the same subject, 
whether on time or at random or in some other way).


If you measure weight weekly for each week in a year this is a time 
series (equally spaced time measurements) and would generally be 
called a longitudinal study (measurements over a lengthy enough 
time period that time-related changes are expected).


Missing data are common in repeated measures or longitudinal 
studies. In longitudinal studies, an additional problem of 
dropouts is present, which may be correlated with the unobserved 
measurement (i.e., missing, not at random or non-ignorable, 
non-random). Also, the long time period of a longitudinal study 
may create issues of measurement bias (due to drift in technique or 
clinicians over time) and change in the subject baseline state.


Time series is typically used in my experience for measurements 
that have a great degree of regularity (equally-spaced times, few or 
no missing data).


Trajectory is a term for continuous time curve.

Examples:

Study to measure blood pressure measurement fluctuations: N subjects 
measured by M operators every 8 hr during a week. Note there is a 
general expectation of a constant mean value for each subject during 
the period, with probably short-time fluctuations. This would be 
called a repeated measure study, Although it could also be called a 
time series study, the expectation of no total time period effect 
and the possibility of missing measurements would argue against that 
term. On the other hand, if a posteriori there were little or no 
missing data, and regular time-dependent patterns were observed, its 
name might be shifted to a time series study.


Cohort study to measure blood pressure changes over a ten year time 
frame for treated and untreated subjects: There will be significant 
amounts of missing data, dropouts from the study and a long time 
period of observation. This would almost universally be known as a 
longitudinal study.


Controlled experiment to measure rates of gelation of batches of 
different gelatins: N lots of gelatin, each measured in solution at 
the same M time periods for viscosity. A continuous underlying 
viscosity vs. time curve is expected (the trajectory) for each lot. 
Time periods are equal, and there are few missing data. The goal is 
to compare gelation trajectories. This is a repeated measure study, 
and might more particularly be characterized as a time series study.


When the subjects are living entities, usually the terms repeated 
measures or longitudinal are used. If measurements are taken at a 
single point in time, the term cross-sectional study is used. If 
there is a single response across time, the term time series is 
used. If there are multiple responses all measured at the same times 
for the subjects, the term panel data is used.


For controlled experiments, the terms repeated measures and time 
series are common. Longitudinal could be used, but generally is not.





Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] time series, longitudinal data or trajectories

2009-06-05 Thread Robert A LaBudde


At 04:02 PM 6/5/2009, Christophe Genolini wrote:

Hi the list

Strictly speaking, this is not a R question, but I need the 
information for the
creation of a package. My question is about vocabulary: What is the 
difference between

time series, longitudinal data and trajectories?

Sincerely

Christophe


Longitudinal data are measurements over long periods of time, often 
at irregular periods, but consistent across subjects.


Repeated measures data are replicates at the same point in time, or 
over a short period of time (e.g., laboratory experiments).


Time series typically have constant increments of time and 
typically a stochastic character, although this term might be 
considered all-encompassing for all measurements at different times.


Trajectory implies a continuous curve in time, as opposed to 
discrete times. Trajectory also implies an underlying causal model, 
as it is a term from kinematics.


I hope this helps.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Block factor as random or fixed effect?

2009-05-13 Thread Robert A LaBudde


At 05:49 PM 5/13/2009, Rob Knell wrote:

People

I apologise for asking a general stats question, but I'm at a bit of a
loss as to what to do following some hostile referees' comments. If I
have a fully randomised blocked design, with only three blocks, should
I treat block as a random or fixed effect? I have read comments about
not treating block as a random effect if the number of blocks is less
than 6 or 7: is this right?

Any advice much appreciated

Rob Knell


If you treat the variable as fixed effects, then inference will only 
apply to those particular choices of blocks. If you treat the 
variable as a random effect, you are probably going to estimate a 
variance for a population distribution plus a mean effect, so 
inference can be made to the population of all possible blocks.


The rule you've probably seen quoted could be paraphrased to say: If 
you're trying to estimate a random effect (i.e., variance), you will 
need at least 6 subjects, or you won't get any precision on the 
estimate. For fewer than 6 subjects, you might as well give up on 
modeling a random effect, and just settle for doing the fixed effects model.


That being said, if you really need inferences on the population of 
blocks, model the random effect and bite the bullet on the imprecision.


Also, remember the assumption that the blocks are chosen randomly 
(from a normal distribution). If they're not, stick with the fixed 
effects model.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] p-values from bootstrap - what am I not understanding?

2009-04-12 Thread Robert A LaBudde

There is really nothing wrong with this approach, which differs 
primarily from the permutation test in that sampling is with 
replacement instead of without replacement (multinomial vs. multiple 
hypergeometric).


One of the issues that permutation tests don't have is bias in the statistic.

In order for bootstrap p-values to be reasonably accurate, you need a 
reasonable dataset size, so that sampling with replacement isn't a 
big effect, and so that enough patterns arise in resampling. It also 
helps if the data is continuous instead of categorical or binary.


The same issues affect permutation tests, but untroubled by bias.

The usual methods for p-values (e.g., see Fisher's test in Agresti's 
Categorical Analysis) work here. Typically there is some ambiguity on 
how to treat the values equal to the observed statistic. If you 
include it, the p-value is conservative for rejection. If you don't, 
it's liberal for rejection. If you include 1/2 weight, it averages 
correctly in the long run.


Ditto for 2-tailed p-values vs. single tails. Several different 
methods (some of which you listed) are used.


As a general rule, if you have data from which you wish a p-value, a 
permutation (i.e., without replacement) test is used, but for 
confidence intervals, bootstrapping (i.e., with replacement) is used.


For reasonably large datasets, both methods will agree closely. But 
permutation tests are typically used for smaller size datasets. 
(Think binomial vs. hypgeometric distributions for p-values, and when 
they agree.)


At 05:47 PM 4/12/2009, Johan Jackson wrote:

Dear stats experts:
Me and my little brain must be missing something regarding bootstrapping. I
understand how to get a 95%CI and how to hypothesis test using bootstrapping
(e.g., reject or not the null). However, I'd also like to get a p-value from
it, and to me this seems simple, but it seems no-one does what I would like
to do to get a p-value, which suggests I'm not understanding something.
Rather, it seems that when people want a p-value using resampling methods,
they immediately jump to permutation testing (e.g., destroying dependencies
so as to create a null distribution). SO - here's my thought on getting a
p-value by bootstrapping. Could someone tell me what is wrong with my
approach? Thanks:

STEPS TO GETTING P-VALUES FROM BOOTSTRAPPING - PROBABLY WRONG:

1) sample B times with replacement, figure out theta* (your statistic of
interest). B is large ( 1000)

2) get the distribution of theta*

3) the mean of theta* is generally near your observed theta. In the same way
that we use non-centrality parameters in other situations, move the
distribution of theta* such that the distribution is centered around the
value corresponding to your null hypothesis (e.g., make the distribution
have a mean theta = 0)

4) Two methods for finding 2-tailed p-values (assuming here that your
observed theta is above the null value):
Method 1: find the percent of recentered theta*'s that are above your
observed theta. p-value = 2 * this percent
Method 2: find the percent of recentered theta*'s that are above the
absolute value of your observed value. This is your p-value.

So this seems simple. But I can't find people discussing this. So I'm
thinking I'm wrong. Could someone explain where I've gone wrong?


J Jackson

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Physical Units in Calculations

2009-04-12 Thread Robert A LaBudde


At 08:00 PM 4/12/2009, Tom La Bone wrote:


Back in 2005 I had been doing most of my work in Mathcad for over 10 years.
For a number of reasons I decided to switch over to R. After much effort and
help from you folks I am finally thinking in R rather than thinking in
Mathcad and trying to translating to R. Anyway, the only task I still use
Mathcad for is calculations that involve physical quantities and units. For
example, in Mathcad I can add 1 kilometer to 1 mile and get the right answer
in the units of length I choose. Likewise, if I try to add 1 kilometer to 1
kilogram I get properly chastised. Is there a way in R to assign quantities
and units to numbers and have R keep track of them like Mathcad does?


Yes, but it's a lot of work: Create objects with units as an attribute.

Perhaps someone else can tell you if such a set of definitions already exists.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Genstat into R - Randomisation test

2009-04-09 Thread Robert A LaBudde


At 04:43 AM 4/9/2009, Tom Backer Johnsen wrote:

Peter Dalgaard wrote:
 Mike Lawrence wrote:
 Looks like that code implements a non-exhaustive variant of the
 randomization test, sometimes called a permutation test.

 Isn't it the other way around? (Permutation tests can be 
exhaustive by looking at all permutations, if a randomization test 
did that, then it wouldn't be random.)


Eugene Edgington wrote an early book (1980) on this subject called 
Randomization tests, published by Dekker.  As far as I remember, 
he differentiates between Systematic permutation tests where one 
looks at all possible permutations.  This is of course prohibitive 
for anything beyond trivially small samples.  For larger samples he 
uses what he calls Random permutations, where a random sample of 
the possible permutations is used.


Tom

Peter Dalgaard wrote:

Mike Lawrence wrote:

Looks like that code implements a non-exhaustive variant of the
randomization test, sometimes called a permutation test.
Isn't it the other way around? (Permutation tests can be exhaustive 
by looking at all permutations, if a randomization test did that, 
then it wouldn't be random.)


Edginton and Onghena make a similar distinction in their book, but I 
think such a distinction is without merit.


Do we distinguish between exact definite integrals and 
approximate ones obtained by numerical integration, of which Monte 
Carlo sampling is just one class of algorithms? Don't we just say: 
The integral was evaluated numerically by the [whatever] method to 
an accuracy of [whatever], and the value was found to be [whatever]. 
Ditto for optimization problems.


A randomization test has one correct answer based upon theory. We are 
simply trying to calculate that answer's value when it is difficult 
to do so. Any approximate method that is used must be performed such 
that the error of approximation is trivial with respect to the 
contemplated use.


Doing Monte Carlo sampling to find an approximate answer to a 
randomization test, or to an optimization problem, or to a bootstrap 
distribution should be carried out with enough realizations to make 
sure the residual error is trivial.


As Monte Carlo sampling is a random sampling-based approximate 
method. The name does create confusion in terminology for 
randomization tests for bootstrapping.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] course in ecological data analysis

2009-04-03 Thread Robert A LaBudde


At 05:27 AM 4/3/2009, Capelle, Jacob wrote:

Dear all,

For my PhD study I'm looking for relevant courses/workshops (short term)
in ecological data anlysis with R in Europe. After 2 days searching I'm
convinced that google is probably not the right medium to find this
information. If anyone can help me I will be most grateful.
Best regards - J. Capelle


Try http://www.statistics.com/


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] 'stretching' a binomial variable

2009-03-27 Thread Robert A LaBudde


At 06:49 AM 3/27/2009, imicola wrote:


Hi,

Im carrying out some Bayesian analysis using a binomial response variable
(proportion: 0 to 1), but most of my observations have a value of 0 and many
have very small values (i.e. 0.001).  I'm having troubles getting my MCMC
algorithm to converge, so I have decided to try normalising my response
variable to see if this helps.

I want it to stay between 0 and 1 but to have a larger range of values, or
just for them all to be slightly higher.

Does anyone know the best way to acheive this?  I could just add a value to
each observation (say 10 to increase the proportion a bit, but ensuring it
would still be between 0 and 1) - would that be ok?  Or is there a better
way to stretch the values up?

Sorry - i know its not really an R specific question, but I have never found
a forum with as many stats litterate people as this one :-)

Cheers - any advice much appreciated!

nicola


Work with events instead of proportions, and use a Poisson model.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Logistic regression problem

2008-10-01 Thread Robert A LaBudde

It would not be possible to answer your original 
question until you specify your goal.


Is it to develop a model with external validity 
that will generalize to new data? (You are not 
likely to succeed, if you are starting with a 
boil the ocean approach with 44,000+ covariates 
and millions of records.) This is the point Prof. Harrell is making.


Or is it to reduce a large dataset to a tractable 
predictor formula that only interpolates your dataset?


If the former, you will need external modeling 
information to select the wheat from the chaff 
in your excessive predictor set.


Assuming it is the latter, then almost any 
approach that ends up with a tractable model 
(that has no meaning other than interpolation of 
this specific dataset) will be useful. For this, 
regression trees or even stepwise regression 
would work. The algorithm must be very simple and 
computer efficient. This is the area of data mining approaches.


I would suggest you start by looking at covariate 
patterns to find out where the scarcity lies. 
These will end up high leverage data.


Another place to start is common sense: Thousands 
of covariates cannot all contain independent 
information of value. Try to cluster them and 
pick the best representative from each cluster 
based on expert knowledge. You may solve your problem quickly that way.


At 05:34 AM 10/1/2008, Bernardo Rangel Tura wrote:
Em Ter, 2008-09-30 Ã s 18:56 -0500, Frank E 
Harrell Jr escreveu:  Bernardo Rangel Tura 
wrote:   Em SÃ¡b, 2008-09-27 Ã s 10:51 -0700, 
milicic.marko escreveu:   I have a huge data 
set with thousands of variable and one 
binary   variable. I know that most of the 
variables are correlated and are not   good 
predictors... but... It is very hard 
to start modeling with such a huge dataset. What 
would   be your suggestion. How to make a 
first cut... how to eliminate most   of the 
variables but not to ignore potential 
interactions... for   example, maybe variable 
A is not good predictor and variable B is 
not   good predictor either, but maybe A and 
B together are good   predictor... 
Any suggestion is welcomed   
milicic.marko I think do you start with 
a rpart(binary variable~.)   This show you a 
set of variables to start a model and the start 
set to   curoff  for continous variables   I 
cannot imagine a worse way to formulate a 
regression model.  Reasons  include   1. 
Results of recursive partitioning are not 
trustworthy unless the  sample size exceeds 
50,000 or the signal to noise ratio is extremely 
high.   2. The type I error of tests from the 
final regression model will be  extraordinarily 
inflated.   3. False interactions will appear 
in the model.   4. The cutoffs so chosen will 
not replicate and in effect assume that  
covariate effects are discontinuous and 
piecewise flat.  The use of  cutoffs results in 
a huge loss of information and power and makes 
the  analysis arbitrary and impossible to 
interpret (e.g., a high covariate  value:low 
covariate value odds ratio or mean difference is 
a complex  function of all the covariate values 
in the sample).   5. The model will not 
validate in new data. Professor Frank, Thank you 
for your explain. Well, if my first idea is 
wrong what is your opinion on the following 
approach? 1- Make PCA with data excluding the 
binary variable 2- Put de principal components 
in logistic model 3- After revert principal 
componentes in variable (only if is interesting 
for milicic.marko) If this approach is wrong too 
what is your approach? -- Bernardo Rangel Tura, 
M.D,MPH,Ph.D National Institute of Cardiology 
Brazil 
__ 
R-help@r-project.org mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help 
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html and 
provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Located Latent Class Analysis (Uebersax)

2008-09-29 Thread Robert A LaBudde

Ubersax includes an .exe file in his zip package. This should solve 
your problem, if you use Windows. If you use linux, you should be 
able to easily find a free FORTRAN compiler.


At 07:24 AM 9/29/2008, Karen Chan wrote:

Dear list members

I am new to the list and would be much appreciated if you could help me.

I am very interested in applying Latent Class Model for analysing 
multiple raters agreement data.  I have recently found a paper by 
Uebersax, J. (1993) which described a method in a form of Located 
Latent Class Analysis (llca).  Uebersax has written a Fortran 
program which is available on the web, for the method described in 
the paper.  I do not have a Fortran compiler on my PC. I know there 
are many R packages do latent class type models.


I am wondering if anyone has implemented llca, or equivalent in R.

Thank you in advance for your help

-- Karen Chan
DHSRU
University of Dundee


The University of Dundee is a Scottish Registered Charity, No. SC015096.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] FW: logistic regression

2008-09-28 Thread Robert A LaBudde

 that the FDA
 insists on clinical trials, after all (and reasons why such studies
 are difficult and expensive to do!).

 The belief that data mining (as it is known in the polite circles
 that Frank obviously eschews) is an effective (and even automated!)
 tool for discovering how Nature works is a misconception, but one that
 for many reasons is enthusiastically promoted.  If you are looking
 only to predict, it may do; but you are deceived if you hope for
 Truth. Can you get hints? -- well maybe, maybe not. Chaos beckons.

 I think many -- maybe even most -- statisticians rue the day that
 stepwise regression was invented and certainly that it has been
 marketed as a tool for winnowing out the important few variables
 from the blizzard of irrelevant background noise. Pogo was right: 
 We have seen the enemy -- and it is us.

 (As I said, the Inferno awaits...)

 Cheers to all,
 Bert Gunter

 DEFINITELY MY OWN OPINIONS HERE!



 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of David Winsemius
 Sent: Saturday, September 27, 2008 5:34 PM
 To: Darin Brooks
 Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: Re: [R] FW: logistic regression

 It's more a statement that it expresses a statistical perspective very
 succinctly, somewhat like a Zen koan.  Frank's book,Regression
 Modeling Strategies, has entire chapters on reasoned approaches to your
question.
 His website also has quite a bit of material free for the taking.

 --
 David Winsemius
 Heritage Laboratories

 On Sep 27, 2008, at 7:24 PM, Darin Brooks wrote:

 Glad you were amused.

 I assume that booking this as a fortune means that this was an
 idiotic way to model the data?

 MARS?  Boosted Regression Trees?  Any of these a better choice to
 extract significant predictors (from a list of about 44) for a
 measured dependent variable?

 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED]
 ] On
 Behalf Of Ted Harding
 Sent: Saturday, September 27, 2008 4:30 PM
 To: [EMAIL PROTECTED]
 Subject: Re: [R] FW: logistic regression



 On 27-Sep-08 21:45:23, Dieter Menne wrote:
 Frank E Harrell Jr f.harrell at vanderbilt.edu writes:

 Estimates from this model (and especially standard errors and
 P-values)
 will be invalid because they do not take into account the stepwise
 procedure above that was used to torture the data until they
 confessed.

 Frank
 Please book this as a fortune.

 Dieter
 Seconded!
 Ted.


--
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University
No virus found in this incoming message.
Checked by AVG - http://www.avg.com

1:11 PM

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R function which finds confidence interval for binomial variance

2008-09-26 Thread Robert A LaBudde


Thanks Ralph, Moshe and [EMAIL PROTECTED] for you helpful comments.

Using bootstrap (e.g., 'boot' + boot.ci()) for the confidence 
interval on the variance is not very accurate in coverage, because 
the sampling distribution is extremely skewed. In fact, the 'BCa' 
method returns the same result as the Efron 'percent' method.


Moshe's idea of treating the confidence interval for the binomial 
variance as a transform of the confidence interval for the binomial 
proportion is elegant (Doh! Why didn't I think of that?), except that 
the transform is bivalued, although monotonic on each branch, with 
the branch point singularity at p=0.5.


The bootstrap method does not have much coverage accuracy for any 
proportion, for n=6, 12 and 20, and the proportion method works great 
for n=6, 12, 20 and 50, except near p = 0.5, where it fails to 
achieve reasonable coverage.


So I'm still looking for a reliable method for all p and for reasonable n.

The proportion-based method is the best I've found, so far.

Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Confidence interval for binomial variance

2008-09-26 Thread Robert A. LaBudde

Based on simulations, I've come up with a simple function to compute 
the confidence interval for the variance of the binomial variance, 
where the true variance is


v   = rho*(1-rho)/n

where rho = true probability of success and n = # of trials.

For x = # successes observed in n trials, p = x / n as usual.

For p  0.25 or p  0.75, I use the proportion-based transformed 
confidence interval, as suggested by Moshe Olshansky. I used the 
Wilson interval, but some might prefer the Blaker interval. For p = 
0.25 or p = 0.75, I use the standard chi-square based confidence 
interval (which is very conservative for this problem in this range). 
The composite method gives reliable results over a wide range of rho.


This should suit the purpose until someone comes up with a more 
theoretically sound method. Bootstrap is not reliable for n  50, at least.


#09.26.08 02.50 binomVarCI.r
#copyright 2008 by Robert A LaBudde, all rights reserved
#CI for binomial sample variance
#created: 09.26.08 by r.a. labudde
#changes:

require('binGroup')

binomVarCI- function (n, x, conf=0.95) {
  p- x/n #proportion
  if (p0.25 | p0.75 | x==0 | x==n) {  #use proportion-based CI
pCI- binWilson(n, x, conf.level=conf)  #CI for proportion
vCI- sort(c(pCI[1]*(1-pCI[1])/(n-1), pCI[2]*(1-pCI[2])/(n-1)))
  } else {  #use chi-square-based CI
phiL- qchisq(0.025, n-1)/(n-1)
phiU- qchisq(0.975, n-1)/(n-1)
#vest- p*(1-p)/(n-1))  #variance estimate
vCI- c(vest/phiU, vest/phiL) #chi-square-based
  }
  return (vCI)
}

Here is a test program to measure coverage:

#09.26.08 03.10 tbinomVarCI.r
#copyright 2008 by Robert A LaBudde, all rights reserved
#test of CI for binomial sample variance
#created: 09.26.08 by r.a. labudde
#changes:

nReal - 1000

for (POD in c(0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.50)) {
  cat('\nPOD: ', sprintf('%8.4f', POD), '\n')
  for (nRepl in c(6, 12, 20, 50)) {
vtrue- POD*(1-POD)/nRepl
pcover- 0
for (iReal in 1:nReal) {
  x- rbinom(1, nRepl, POD)
  vCI- binomVarCI(nRepl, x)
  #vest- (x/nRepl)*(1 - x/nRepl)/(nRepl-1)
  if (vtrue = vCI[1]  vtrue= vCI[2]) pcover- pcover + 1
} #iReal
pcover- pcover/nReal
cat('n: ', sprintf('%4i', nRepl), ' Coverage: ', 
sprintf('%8.4f', pcover), '\n')

  } #nRepl
} #POD

Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] R function which finds confidence interval for binomial variance

2008-09-24 Thread Robert A. LaBudde

I need to construct confidence intervals for the binomial variance. 
This is the usual estimate


v = x*(n-x)/n

or its unbiased counterpart

v' = x*(n-x)/(n-1)

where x = binomial number of successes observed in n Bernoulli trials 
from proportion p.


The usual X^2 method for variance confidence intervals will not work, 
because of the strong non-normal character of the sampling 
distribution for v (or v').


Does anyone know of an R package with R function that computes a 
reasonable confidence interval for v or v'?


Thanks.

Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Averaging 'blocks' of data

2008-09-07 Thread Robert A LaBudde

I'm not sure I exactly understand your problem, but if you are 
looking for a recursive algorithm for calculating the average by 
addition of one record only at a time, consider:


y[k] = y[k-1] + (x[k] - y[k-1])/k,  where y(0) = 0, k = 1, 2, ...

At each stage, y[k] = (x[1]+...+x[k])/k.


At 04:46 PM 9/7/2008, Steve Murray wrote:

Gabor - thanks for your suggestion... I had checked the previous 
post, but I found (as a new user of R) this approach to be too 
complicated and I had problems gaining the correct output values. If 
there is a simpler way of doing this, then please feel free to let me know.


Dylan - thanks, your approach is a good start. In answer to your 
questions, my data are 43200 columns and 16800 rows as a data frame 
- I will probably have to read the dataset in segments though, as it 
won't fit into the memory!  I've been able to follow your example - 
how would I be able to apply this technique for finding the average 
of each 60 x 60 block?


Any other suggestions are of course welcome!

Many thanks again,

Steve

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Projecting Survival Curve into the Future

2008-09-04 Thread Robert A LaBudde


You might consider a probit analysis using ln(Time) as the dose.

At 09:24 AM 9/4/2008, [EMAIL PROTECTED] wrote:

 I have a survivor curve that shows account cancellations during the
 past 3.5 months. Â Fortunately for our business, but unfortunately
 for my analysis, the survivor curve doesn't yet pass through 50%.
 Â Is there a safe way to extend the survivor curve and estimate at
 what time we'll hit 50%?

Without any example code it's hard to say, but take a look at
?predict.coxph and ?predict.survreg in the survival package.

Regards,
Richie.

Mathematical Sciences Unit
HSL



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Upgrading R means I lose my packages

2008-08-28 Thread Robert A LaBudde


At 04:12 AM 8/28/2008, ONKELINX, Thierry wrote:


On a windows machine you get the same problem. Useless one uses tha same
trick as Rolf suggested: don't install the packages in the default
directory and set R_LIBS to that directory. Then all you need to do
after an upgrade is to set R_LIBS in the new version and run
update.package(checkBuilt = TRUE). Given Rolf's suggestion I suppose
this trick will work on a Mac too.


What I do in installing a new version of R on a Windows system is as follows:

1. In the c:\Program Files\R folder, the installation is in a 
subfolder labeled by the version, such as R-2.7.1.


2. I leave the old (say, 2.7.1) version installed, and install the 
new version (say, 2.7.2). This leaves the old subfolder R-2.7.1 
intact, and creates a new one R-2.7.2.


3. I use a file-compare utility (in my case, Beyond Compare, which I 
recommend), to compare the subfolders C:\Program 
Files\R\R-2.7.1\library and C:\Program Files\R\R-2.7.2\library. I 
set the comparison to find files present or newer in the 2.7.1 folder 
vs. the 2.7.2. Then I copy all such files over.


4. At this point, the 2.7.2 has the same or new packages than 2.7.1, 
most or all of which will work.


5. I use the Packages|Update Package ... to update packages to 2.7.2.

6. Then I delete the 2.7.1 subfolder.

You need Administrator rights to do this.





Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Specifying random effects distribution in glmer()

2008-08-24 Thread Robert A. LaBudde

I'm trying to figure out how to carry out a Poisson regression fit to 
longitudinal data with a gamma distribution with unknown shape and 
scale parameters.


I've tried the 'lmer4' package's glmer() function, which fits the 
Poisson regression using:


library('lme4')
fit5- glmer(seizures ~ time + progabide + timeXprog + 
offset(lnPeriod) + (1|id),
  data=pdata, nAGQ=1, family=poisson) #note: can't use nAGQ1, not 
yet implemented

summary(fit5)

Here 'seizures' is a count and 'id' is the subject number.

This fit works, but uses the Poisson distribution with the gamma heterogeneity.

Based on the example in the help for glmer(), I tried

fit6- glmer(seizures ~ time + progabide + timeXprog + offset(lnPeriod) +
  (1|pgamma(id, shap, scal)), data=pdata, nAGQ=1, start=c(shap=1, scal=1),
  family=poisson) #note: can't use nAGQ1, not yet implemented
summary(fit6)

but this ends up with Error in pgamma(id, shap, scal) : object 
shap not found.


My questions are:

1. Can this be done?
2. Am I using the right package and function?
3. What am I doing wrong?

Any help would be appreciated.

Thanks.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Null and Alternate hypothesis for Significance test

2008-08-21 Thread Robert A LaBudde

 for Significance test
  To: r-help@r-project.org
  Received: Friday, 22 August, 2008, 6:58 AM
  Hi,
  I had a question about specifying the Null hypothesis in a
  significance test.
  Advance apologies if this has already been asked previously
  or is a naive question.
 
  I have two samples A and B, and I want to test whether A
  and B come from the same distribution. The default Null 
hypothesis would be H0: A=B

  But since I am trying to prove that A and B indeed come
  from the same distribution, I think this is not the right choice for the
  null hypothesis (it should be one that is set up to be rejected)
 
  How do I specify a null hypothesis H0: A not equal to B for
  say a KS test.
  An example to do this in R would be greatly appreciated.
 
  On a related note: what is a good way to measure the
  difference between observed and expected PDFs? Is the D 
statistic of the KS test a good choice?

 
  Thanks!
  Nitin
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
  http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained,
  reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] shapiro wilk normality test

2008-07-12 Thread Robert A LaBudde


At 11:30 AM 7/12/2008, Bunny, lautloscrew.com wrote:

Hi everybody,

somehow i dont get the shapiro wilk test for normality. i just can´t
find what the H0 is .

i tried :

 shapiro.test(rnorm(5000))

Shapiro-Wilk normality test

data:  rnorm(5000)
W = 0.9997, p-value = 0.6205


If normality is the H0, the test says it´s probably not normal, doesn ´t it ?

5000 is the biggest n allowed by the test...

are there any other test ? ( i know qqnorm already ;)

thanks in advance

matthias


Yes, H0 is normality. The P-value, as for other 
statistical tests, measures the probability that 
this sample could have arisen from the population under H0.


0.62 is a probability very compatible with H0. 
The typical rejection criterion would be a 
P-value  0.05, which is not the case here.


The limitation to n = 5000 is not serious, as 
even a few hundred data should take you to the 
asymptotic region. Use sample() to select the 
data at random from within your data set to avoid bias in using the test. E.g.,


shapiro.test(sample(mydata, 1000, replace=TRUE))







Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] shapiro wilk normality test

2008-07-12 Thread Robert A LaBudde


At 12:48 PM 7/12/2008, Bunny, lautloscrew.com wrote:

first of all thanks yall. it´s always good to get it from people that
know for sure.

my bad, i meant to say it´s compatible with normality. i just wanted
to know if it wouldnt be better to test for non-normality in order to
know for sure.
and if so, how can i do it?


Doing a significance test may seem complicated, 
but it's an almost trivial concept.


You assume some null hypothesis that specifies 
a unique distribution that you can use to 
calculate probabilities from. Then use this 
distribution to calculate the probability of 
finding what you found in your data, or more 
extreme. This is the P-value of the test. It is 
the probability of finding what you found, given 
that the null hypothesis is true. You give up 
(reject) the null hypothesis if this P-value is 
too unbelievably small. The conventional measure 
for ordinary, repeatable experiments is 0.05. 
Sometimes a smaller value like 0.01 is more reasonable.


Doing what has been suggested, i.e., using a null 
hypothesis of nonnormality, is unworkable. 
There are uncountably infinite ways to specify a 
nonnormal distribution. Is it discrete or 
continuous? Is it skewed or symmetric? Does it go 
from zero to infinity, from 0 to 1, from 
-infinity to infinity, or anything else? Does it 
have one mode or many? Is it continuous or differentiable? Etc.


In order to do a statistical test, you must be 
able to calculate the P-value. That usually means 
your null hypothesis must specify a single, unique probability distribution.


So nonnormal in testing means reject normal as 
the distribution. Nonnormal is not defined 
other than it's not the normal distribution.


If you wish to test how the distribution is 
nonnormal, within some family of nonnormal 
distributions, you will have to specify such a 
null hypothesis and test for deviation from it.


E.g., testing for coefficient of skewness = 0.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Is there a good package for multiple imputation of missing values in R?

2008-06-30 Thread Robert A. LaBudde

I'm looking for a package that has a start-of-the-art method of 
imputation of missing values in a data frame with both continuous and 
factor columns.


I've found transcan() in 'Hmisc', which appears to be possibly suited 
to my needs, but I haven't been able to figure out how to get a new 
data frame with the imputed values replaced (I don't have Herrell's book).


Any pointers would be appreciated.

Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Is there a good package for multiple imputation of missing values in R?

2008-06-30 Thread Robert A LaBudde


At 03:02 AM 6/30/2008, Robert A. LaBudde wrote:
I'm looking for a package that has a start-of-the-art method of 
imputation of missing values in a data frame with both continuous 
and factor columns.


I've found transcan() in 'Hmisc', which appears to be possibly 
suited to my needs, but I haven't been able to figure out how to get 
a new data frame with the imputed values replaced (I don't have 
Herrell's book).


Any pointers would be appreciated.


Thanks to paulandpen, Frank and Shige for suggestions.

I looked at the packages 'Hmisc', 'mice', 'Amelia' and 'norm'.

I still haven't mastered the methodology for using aregImpute() in 
'Hmisc' based on the help information. I think I'll have to get hold 
of Frank's book to see how it's used in a complete example.


'Amelia' and 'norm' appear to be focused solely on continuous, 
multivariate normal variables, but my needs typically involve 
datasets with both factors and continuous variables.


The function mice() in 'mice' appears to best suit my needs, and the 
help file was intelligible, and it works on both factors and 
continuous variables.


For those in the audience with similar issues, here is a code snippet 
showing how some of these functions work ('felon' is a data frame 
with categorical and continuous predictors of the binary variable 'hired'):


library('mice') #missing data imputation library for md.pattern(), 
mice(), complete()

names(felon)  #show variable names
md.pattern(felon[,1:4]) #show patterns for missing data in 1st 4 vars

library('Hmisc')  #package for na.pattern() and impute()
na.pattern(felon[,1:4]) #show patterns for missing data in 1st 4 vars

#simple imputation can be done by
felon2- felon  #make copy
felon2$felony- impute(felon2$felony) #impute NAs (most frequent)
felon2$gender- impute(felon2$gender) #impute NAs
felon2$natamer- impute(felon2$natamer) #impute NAs
na.pattern(felon2[,1:4]) #show no NAs left in these vars
fit2- glm(hired ~ felony + gender + natamer, data=felon2, family=binomial)
summary(fit2)

#better, multiple imputation can be done via mice():
imp- mice(felon[,1:4]) #do multiple imputation (default is 5 realizations)
for (iSet in 1:5) {  #show results for the 5 imputation datasets
  fit- glm(hired ~ felony + gender + natamer,
data=complete(imp, iSet), family=binomial)  #fit to iSet-th realization
  print(summary(fit))
}


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to make R running faster

2008-05-28 Thread Robert A LaBudde


At 10:25 AM 5/28/2008, Esmail Bonakdarian wrote:

Erin Hodgess wrote:

I remember reading the colSum and colMean were better, when you need
sums and means


Well .. I'm waiting for the experts to jump in and give us the
straight story on this :-)


All of the algorithms are represented internally by sequential 
program logic using C or Fortran, for example. So the issue isn't the 
algorithm itself. Instead, it's where the algorithm is implemented.


However, R is an interpreter, not a compiler. This means that it 
reads each line of R code one character at a time to develop an 
understanding of what is desired done, and to check for errors in 
syntax and data classes. Interpreters are very slow compared to 
compiled code, where the lines have been pre-interpreted and already 
converted to machine code with error checking resolved.


For example a simple for loop iteration might take only 0.1 
microsecond in a compiled program, but 20-100 microseconds in an 
interpreted program.


This overhead of parsing each line can be bounded by function calls 
inside each line. If the compiled function executes on a large number 
of cases in one call, then the 50 microsecond overhead per call is diluted out.


R is a parallel processing language. If you use vectors and arrays 
and the built-in (i.e., compiled) function calls, you get maximum use 
of the compiled programs and minimum use of the interpreted program.


This is why functions such as colMeans() or apply() are faster than 
writing direct loops in R. You can speed things up by 200-1000x for 
large arrays.


Interpreted languages are very convenient to use, as they do instant 
error checking and are very interactive. No overhead of learning and 
using compilers and linkers. But they are very slow on complex 
calculations. This is why the array processing is stuffed into 
compiled functions. The best of both worlds then.


Interpreted languages are Java, R, MatLab, Gauss and others. Compiled 
languages are C and Fortran. Some, like variants of BASIC, can be 
interpreted, line-compiled or compiled, depending upon 
implementation. Some compiled languages (such as Fortran), can allow 
parallel processing via multiprocessing on multiple CPUs, which 
speeds things up even more. Compiled languages also typically 
optimize code for the target machine, which can speed things up a 
factor of 2 or so.


So the general rule for R is: If you are annoyed at processing time, 
alter your program to maximize calculations within compiled functions 
(i.e., vectorize your program to process an entire array at one 
time) and minimize the number of lines of R.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Solving 100th order equation

2008-05-24 Thread Robert A LaBudde


At 07:58 AM 5/24/2008, Duncan Murdoch wrote:

Shubha Vishwanath Karanth wrote:
To apply uniroot I don't even know the interval values... Does 
numerical methods help me? Or any other method?




I forgot:  we also have polyroot().
Duncan Murdoch

Thanks and Regards,
Shubha

-Original Message-
From: Duncan Murdoch [mailto:[EMAIL PROTECTED] Sent: Saturday, 
May 24, 2008 5:08 PM

To: Shubha Vishwanath Karanth
Subject: Re: [R] Solving 100th order equation

Shubha Vishwanath Karanth wrote:


Hi R,



I have a 100th order equation for which I need to solve the value 
for x. Is there a package to do this?

snip


Finding all of the roots of a 100-th order polynomial on a computer 
is not an easy task. It would take move than 100 digits of precision 
to do so. A similar problem to finding all 100 eigenvalues of a 
100x100 ill-conditioned matrix.


The suggestion to use graphics to find small intervals localizing the 
roots of interest first would make the most sense, at least for the real roots.


It is highly unlikely that all 100 roots are needed. Usually the ones 
of interest are the one with the largest real part, or the ones 
inside the unit circle, or the real roots, etc.


Is this a homework problem?


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Simulation study in R

2008-04-29 Thread Robert A LaBudde


At 02:40 AM 4/29/2008, Arun Kumar Saha wrote:

Here I am in a simulation study where I want to find different values
of x and y such that f(x,y)=c (some known constant) w.r.t. x, y 0,
y=x and x=c1 (another known constant). Can anyone please tell me how
to do it efficiently in R. One way I thought that I will draw
different random numbers from uniform dist according to that
constraints and pick those which satisfy f(x,y)=c. However it is not I
think computationally efficient. Can anyone here suggest me any other
efficient approach?


You have not specified the distributions proper for X and Y. Using a 
uniform distribution is only appropriate when it meets requirements.


One obvious approach is to sample one of the variables, say X, and 
then solve your equation for Y. If you're going to draw a lot of 
samples, it would pay to develop y = g(x) first.


But you need to know how to sample X in the first place. Is its 
distribution uniform, or something else?



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] quasi-random sequences

2008-04-26 Thread Robert A LaBudde

You seem to have ambiguous requirements.

First, you want equidistribution for a packing 
structure, which would suggest closest packing or 
quasirandom sequences, as you have tried.

But then you are disturbed by the packing 
structure, because it gives a perceivable 
pattern, so you wish to randomize it, but under 
some unspecified condition of equidistribution 
(your electrostatic repulsion algorithm).

You obviously have some constraints on selection 
you have not quantified yet. E.g., circles of unspecified radii cannot overlap.

You should also realize that any distribution of 
centers under such constraints will always 
exhibit structure due to the constraints.

Is your problem simply to give an appearance of 
randomness to the casual observer, or something more definite?

You also need to say something about the packing density involved.

On the face of it, you with
At 06:22 AM 4/26/2008, baptiste Auguié wrote:
Dear list useRs,

I have to generate a random set of coordinates (x,y) in [-1 ; 1]^2
for say, N points. At each of these points is drawn a circle (later
on, an ellipse) of random size, as in:


  N - 100
 
  positions - matrix(rnorm(2 * N, mean = 0 , sd= 0.5), nrow=N)
  sizes-rnorm(N, mean = 0 , sd= 1)
  plot(positions,type=p,cex=sizes)


My problem is to avoid collisions (overlap, really) between the
points. I would like some random pattern, but with a minimum
exclusion distance. In looking up Numerical recipes in C, I found
out about some Sobol quasi-random sequences, which one can call from
the gsl package,


  library(gsl)
 
  g - qrng_alloc(type=sobol,dim=2)
  qrng_get(g,n= N) -xy
 
  plot((xy),t=p,cex=0.5)

but this does not look very random: I clearly see some pattern
(diagonals, etc...), and even the non-overlapping condition is not
impressive.

One (painful) way I can foresee is to check the distance between each
symbol and the others, and move the overlapping ones in a recursive
manner. Before delving into this, I wanted to check I'm not
overlooking something in the rgl quasi-random sequences, or missing a
more obvious way to generate such patterns. Perhaps solving an
electrostatic problem with a potential both attractive at long
distances and repulsive at short distances is a better way? I have a
vague recollection of hearing that somewhere to position points
evenly on a sphere.


Thanks for any comment / suggestion,

Baptiste


_

Baptiste Auguié

Physics Department
University of Exeter
Stocker Road,
Exeter, Devon,
EX4 4QL, UK

Phone: +44 1392 264187

http://newton.ex.ac.uk/research/emag
http://projects.ex.ac.uk/atto

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] bootstrap for confidence intervals of the mean

2008-04-22 Thread Robert A LaBudde

See the help for boot().

The function in the 2nd argument has to be of a special form.

You need to define such a form, as in:

fmean- function (x, i) mean(x[i]) #use data x[] and indices i

and then

boot.out- boot(d, fmean, R=1000, sim='permutation')

At 12:59 PM 4/22/2008, stephen sefick wrote:
d = c(0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L, 0L, 7375L,
NA, NA, 17092L, 0L, 0L, 32390L, 2326L, 22672L, 13550L, 18285L)

boot.out -boot(d, mean, R=1000, sim=permutation)

Error in mean.default(data, original, ...) :
   'trim' must be numeric of length one

I know that I am missing something but I can't figure it out.
thanks

stephen

--
Let's not spend our time and resources thinking about things that are so
little or so large that all they really do for us is puff us up and make us
feel like gods. We are mammals, and have not exhausted the annoying little
problems of being mammals.

-K. Mullis

 [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Multidimensional contingency tables

2008-04-21 Thread Robert A. LaBudde

How does one ideally handle and display multidimenstional contingency 
tables in R v. 2.6.2?

E.g.:

  prob1- data.frame(victim=c(rep('white',4),rep('black',4)),
+   perp=c(rep('white',2),rep('black',2),rep('white',2),rep('black',2)),
+   death=rep(c('yes','no'),4), count=c(19,132,11,52,0,9,6,97))
  prob1
   victim  perp death count
1  white white   yes19
2  white whiteno   132
3  white black   yes11
4  white blackno52
5  black white   yes 0
6  black whiteno 9
7  black black   yes 6
8  black blackno97

The xtabs() function doesn't seem appropriate, as it has no means of 
using 'count'.

This must be a common problem.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Multidimensional contingency tables

2008-04-21 Thread Robert A LaBudde

Now that is simple and elegant. Thanks!

PS. Is there a course available for learning how to read R help information? :)

At 10:52 PM 4/21/2008, Gabor Grothendieck wrote:
  xtabs(count ~., prob1)

On Mon, Apr 21, 2008 at 10:46 PM, Robert A. LaBudde [EMAIL PROTECTED] wrote:
  How does one ideally handle and display multidimenstional contingency
  tables in R v. 2.6.2?
 
  E.g.:
 
prob1- data.frame(victim=c(rep('white',4),rep('black',4)),
  +   perp=c(rep('white',2),rep('black',2),rep('white',2),rep('black',2)),
  +   death=rep(c('yes','no'),4), count=c(19,132,11,52,0,9,6,97))
prob1
victim  perp death count
  1  white white   yes19
  2  white whiteno   132
  3  white black   yes11
  4  white blackno52
  5  black white   yes 0
  6  black whiteno 9
  7  black black   yes 6
  8  black blackno97
 
  The xtabs() function doesn't seem appropriate, as it has no means of
  using 'count'.
 
  This must be a common problem.
 
  
  Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
  Least Cost Formulations, Ltd.URL: http://lcfltd.com/
  824 Timberlake Drive Tel: 757-467-0954
  Virginia Beach, VA 23464-3239Fax: 757-467-2947
 
  Vere scire est per causas scire
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Problem installing and using package tseries

2008-04-10 Thread Robert A. LaBudde

I have R 2.6.2, and have tried downloading and installing the package 
tseries. I get the same error when I use two different mirror sites:

  utils:::menuInstallPkgs()
trying URL 
'http://cran.mirrors.hoobly.com/bin/windows/contrib/2.6/tseries_0.10-14.zip'
Content type 'application/zip' length 400799 bytes (391 Kb)
opened URL
downloaded 391 Kb

package 'tseries' successfully unpacked and MD5 sums checked

The downloaded packages are in
 C:\Documents and Settings\User\Local 
Settings\Temp\Rtmpk2jrvn\downloaded_packages
updating HTML package descriptions
  library('tseries')
Error in dyn.load(file, ...) :
   unable to load shared library 
'C:/PROGRA~1/R/R-26~1.2/library/tseries/libs/tseries.dll':
   LoadLibrary failure:  The specified module could not be found.


Error: package/namespace load failed for 'tseries'
 

There doesn't seem to be anything unusual. The package is installed 
like any other, and the required file is at C:\Program 
Files\R\R-2.6.2\library\tseries\libs\tseries.dll. It is 36,864 bytes 
and looks superficially okay.

Any ideas as to what's wrong here?

Thanks.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Problem installing and using package tseries

2008-04-10 Thread Robert A LaBudde

Thanks, Uwe.

Note that another posting today had the same type of problem:

  require(Matrix)
  Loading required package: Matrix
  Error in dyn.load(file, ...) :
   unable to load shared library
  'C:/PROGRA~1/R/R-26~1.2/library/Matrix/libs/Matrix.dll':
   LoadLibrary failure:  The specified module could not be found.
 

In both cases, R is looking to the wrong folder apparently.

At 04:54 PM 4/10/2008, Uwe Ligges wrote:
Can you please try to reinstall in roughly 12 hours from now from CRAN master?
I have fixed the problem and recompiled tseries which will appear on 
CRAN master within 12 hours.

Best wishes,
Uwe Ligges


Robert A. LaBudde wrote:
I have R 2.6.2, and have tried downloading and installing the 
package tseries. I get the same error when I use two different mirror sites:
   utils:::menuInstallPkgs()
trying URL 
'http://cran.mirrors.hoobly.com/bin/windows/contrib/2.6/tseries_0.10-14.zip'
Content type 'application/zip' length 400799 bytes (391 Kb)
opened URL
downloaded 391 Kb
package 'tseries' successfully unpacked and MD5 sums checked
The downloaded packages are in
  C:\Documents and Settings\User\Local 
 Settings\Temp\Rtmpk2jrvn\downloaded_packages
updating HTML package descriptions
   library('tseries')
Error in dyn.load(file, ...) :
unable to load shared library 
 'C:/PROGRA~1/R/R-26~1.2/library/tseries/libs/tseries.dll':
LoadLibrary failure:  The specified module could not be found.

Error: package/namespace load failed for 'tseries'
  
There doesn't seem to be anything unusual. The package is installed 
like any other, and the required file is at C:\Program 
Files\R\R-2.6.2\library\tseries\libs\tseries.dll. It is 36,864 
bytes and looks superficially okay.
Any ideas as to what's wrong here?
Thanks.

Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947
Vere scire est per causas scire
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] generate random numbers subject to constraints

2008-03-27 Thread Robert A LaBudde

At 05:06 PM 3/26/2008, Ted Harding wrote:
On 26-Mar-08 21:26:59, Ala' Jaouni wrote:
  X1,X2,X3,X4 should have independent distributions. They should be
  between 0 and 1 and all add up to 1. Is this still possible with
  Robert's method?
 
  Thanks

I don't think so. A whileago you wrote
The numbers should be uniformly distributed (but in the
context of an example where you had 5 variable; now you
are back to 4 variables). Let's take the 4-case first.

The two linear constraints confine the point (X1,X2,X3,X4)
to a triangular region within the 4-dimensional unit cube.
Say it has vertices A, B, C.
You could then start by generating points uniformly distributed
over a specific triangle in 2 dimentions, say the one with
vertices at A0=(0,0), B0=(0,1), C0=(1,0). This is easy.

Then you need to find a linear transformation which will
map this triangle (A0,B0,C0) onto the triangle (A,B,C).
Then the points you have sampled in (A0,B0,C0) will map
into points which are uniformly distributed over the
triangle (A,B,C).

More generally, you will be seeking to generate points
uniformly distributed over a simplex.

For example, the case (your earlier post) of 5 points
with 2 linear constraints requires a tetrahedron with
vertices (A,B,C,D) in 5 dimensions whose coordinates you
will have to find. Then take an easy tetrahedron with
vertices (A0,B0,C0,D0) and sample uniformly within this.
Then find a linear mapping from (A0,B0,C0,D0) to (A,B,C,D)
and apply this to the sampled points.

This raises a general question: Does anyone know of
an R function to sample uniformly in the interior
of a general (k-r)-dimensional simplex embedded in
k dimensions, with (k+1) given vertices?
snip

The method of rejection:

1. Generate numbers randomly in the hypercube.
2. Test to see if the point falls within the prescribed area.
3. Accept the point if it does.
4. Repeat if it doesn't.

Efficiency depends upon the ratio of volumes involved.



Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] peak finding

2008-03-25 Thread Robert A LaBudde

At 10:48 PM 3/24/2008, Chistos wrote:
John,

There is a peak finding algorithm attributed to Prof. Ripley in the R-help
archive.  It has a span parameter which might give you something close to
what you seem to be looking for.

http://finzi.psych.upenn.edu/R/Rhelp02a/archive/33097.html

-Christos

Finding peaks and valleys has several issues:

1. Impact of noise.
2. Mathematical smoothness of the underlying signal.
3. Boundary conditions at the beginning and end of the data series.
4. Bias from smoothing.

If the noise is not too bad, I'd try fitting a smoothing spline, and 
then use the null derivative points to punctuate the extrema.

If the noise is severe, you probably will need some domain knowledge, 
and will end up with perhaps locally weighted regression, followed by 
extrema search.

For cases where the noise is trivial, the following short function 
might be useful in picking off peaks (and symmetrically modified, valleys):

#findpeaks()
#Copyright 2007 by Robert A LaBudde, all rights reserved
#find peaks in (x,y) curve
findpeaks- function(x, y) {
   nx- length(x)
   ny- length(y)
   if (nx != ny) {
 print (' findpeaks01: x and y must be same length!')
 stop
 }
   ipeak- NULL
   xpeak- NULL
   ypeak- NULL
   yprv- y[1]
   for (i in 1:ny) {
 ynext- ifelse(i==ny,y[ny],y[i+1])
 if(yprv  y[i]  y[i]  ynext) { #found local peak
   ipeak- c(ipeak,i)
   xpeak- c(xpeak,x[i])
   ypeak- c(ypeak,y[i])
   }
 yprv- y[i]
 }
   return(data.frame(ipeak,x=xpeak,y=ypeak))
   }

Trivial though it may be, it actually works quite well for its 
intended purpose.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] stepAIC and polynomial terms

2008-03-16 Thread Robert A LaBudde

At 08:50 PM 3/16/2008, caspar wrote:
Dear all,
I have a question regarding the use of stepAIC and polynomial 
(quadratic to be specific) terms in a binary logistic regression 
model. I read in McCullagh and Nelder, (1989, p 89) and as far as I 
remember from my statistics cources, higher-degree polynomial 
effects should not be included without the main effects. If I 
understand this correctly, following a stepwise model selection 
based on AIC should not lead to a model where the main effect of 
some continuous covariate is removed, but the quadratic term is kept.
The question is, should I keep the quadratic term (note, there are 
other main effects that were retained following the stepwise 
algorithm) in the final model or should I delete it as well and move 
on? Or should I retain the main effect as well?

To picture it, the initial model to which I called stepAIC is:

Call:  glm(formula = S ~ FR + Date * age + I(age^2), family = 
logexposure(ExposureDays = DATA$int),  data = DATA)

and the final one:

Call:  glm(formula = S ~ FR + Date + I(age^2), family = 
logexposure(ExposureDays = DATA$int),  data = DATA)

Thanks very much in advance for your thoughts and suggestions,

Caspar

1. You should only exclude age as a linear term if you have sound 
causal reason for believing a pure quadratic component is solely 
present. Based on your example, you probably don't have this.

2. You also need to work about interactions.

3. An alternative to your polynomial approach to such a causal 
variable as age is to categorize age into 5 or 10 year intervals, and 
see how the fit breaks down by these levels.

4. You should plot your data vs. age to see what the dependence is. 
Frequently curve is flat up to a certain age, and then linear 
thereafter. This gives rise to a pseudo-quadratic relationship. You 
should be able to fit it better with the split plus a linear term.

5. Think about how age should affect your response before trying models.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to do multi-factor stratified sampling in R

2008-03-08 Thread Robert A LaBudde

At 03:54 PM 3/8/2008, David Winsemius wrote:
Robert A. LaBudde [EMAIL PROTECTED] wrote in
news:[EMAIL PROTECTED]:

  Given a set of data with a number of variables plus a response, I'd
  like to obtain a randomized subset of the rows such that the
  marginal proportions of each variable are maintained closely in the
  subset to that of the dataset, and possibly maintaining as well the
  two-factor interaction marginal proportions as well for some pairs.
 
  This must be a common problem in data mining, but I don't seem to be
  able to locate the proper library or function for doing this in R.
 
  Thanks for any help.

Have you looked at the sampling package? I have never used it, but the
strata() function appears to be capable.

Thank you for pointing out this package and function. It is going to 
be very useful. I'll have to look into how I'm searching R, as 
'sampling' and 'stratified' should have turned this up.

I will spend some time looking at how strata() works to see how well 
it handles the problems I'm looking at.

Thanks again.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] [R} Getting 'tilting' confidence intervals in R

2007-10-18 Thread Robert A. LaBudde

[Resend with proper subject line]

I am trying to compute bootstrap confidence intervals for a sample 
using R 2.5.1 for Windows.

I can get Normal, Basic, Percentile, BCa and ABC from 
boot.ci() and boot() in the Davison  Hinkley boot package.

But I can't figure out how to use tilt.boot() to get the tilting 
confidence interval. Here's a program snippet:

g = rgamma(N,shape=2,scale=3) #Generate a sample of N random deviates

varf = function (x,i) { var(x[i]) }
b = boot(g, varf, R=1000)
boot.ci(b)

fabc = function (x, w) { sum(x^2*w)/sum(w)-(sum(x*w)/sum(w))^2 } #wgt 
average (biased) variance
abc.ci(g, fabc) #ABC method confidence interval

bt = tilt.boot(g, varf, R=c(1000,1000,1000))

The bt object doesn't have a confidence interval method or property, 
even though alpha=c(.025, .975) is a default parameter in the tilt.boot() call.

This must be simple, but I'm not finding out the answer from the documentation.

Also, use of any other packages to accomplish the tilting interval 
would also be useful.

Thanks.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] applying math/stat functions to rows in data frame

2007-09-15 Thread Robert A LaBudde

At 12:02 PM 9/15/2007, Gerald wrote:
Hi All,

There are a variety of functions that can be applied to a variable
(column) in a data frame: mean, min, max, sd, range, IQR, etc.

I am aware of only two that work on the rows, using q1-q3 as example
variables:

rowMeans(cbind(q1,q2,q3),na.rm=T)   #mean of multiple variables
rowSums (cbind(q1,q2,q3),na.rm=T)   #sum of multiple variables

Can the standard column functions (listed in the first sentence) be
applied to rows, with the use of correct indexes to reference the
columns of interest?  Or, must these summary functions be programmed
separately to work on a row?

Try using t() to transpose the matrix, and then apply the column 
function of interest.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

86 matches

Mail list logo