Re: [R] significance testing for the difference in the ratio of means
groups of size 70 and 72 to represent the two ratios and compute the difference in means. This distribution could represent the distribution under the null hypothesis and I could then measure where my observed value falls to compute the p-value. This however, makes me uncomfortable as it seems to treat the data as a mean of ratios rather than a ratio of means. Method 4: Combination of bootstrap and permutation test Sample with replacement samples of size 7,10,8,9 from m1,m2,m3,m4 respectively as in method 2 above. Calculate the two ratios for these 4 samples (m2/m1 and m4/m3). Record these two ratios into a list. Repeat this process an arbitrary (B) number of times and record the two ratios into your growing list each time. Hence if B = 10, we will have 20 observations of the ratios. Then proceed with permutation testing with these 20 ratio observations by repeatedly randomizing them into two equal groups of 10 and computing the difference in means of the two groups as we did in method 3 above. This could potentially yeild a distribution under the null hypothesis and p-values could be obtained by localizing the observed value on this distribution. I am unsure of appropriate values for B or if this method is valid at all. Another complication would be the concern for multiple comparisons if I wished to include additional test groups (m5 = testgroup2; m6 = testgroup2 w/ treatment; m7 = testgroup3, m8 = testgoup3 w/ treatment...etc) and how that might be appropriately handled. Method 2 seems the most intuitive to me. Bootstrapping this way will likely yield appropriate Starndard Errors for the two ratios. However, I am very much interested in appropriate p-values for the comparison and I am not sure if localizing 0 on the bootstrap distribution of the difference of means is appropriate. Thank you in advance for your suggestions. -Rahul [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Mininum number of resamples required to do BCa bootstrap?
I'm using R 2.15.1 on a 64-bit machine with Windows 7 Home Premium and package 'boot'. I've found that using a number of bootstrap resamples in boot() that is less than the number of data results in a fatal error. Once the number of resamples meets or exceeds the number of data, the error disappears. Sample problem (screwy subscripted syntax is a relic of edited down a more complex script): N - 25 s - rlnorm(N, 0, 1) require(boot) Loading required package: boot v - NULL # hold sample variance estimates i - 1 v[i] - var(s) # get sample variance nReal - 10 varf - function (x,i) { var(x[i]) } fabc - function (x, w) { # weighted average (biased) variance + sum(x^2 * w) / sum(w) - (sum(x * w) / sum(w))^2 + } p - c(.25, .75, .2, .8, .15, .85, .1, .9, .05, .95, .025, .975, .005, .995) cl - c(0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99) b - boot(s, varf, R = nReal) # bootstrap bv - NULL # hold bootstrap mean variance estimates bias - NULL #hold bias estimates bv[i] - mean(b$t) # bootstrap mean variance bias[i] - bv[i] - v[i] # bias estimate bCI90 - boot.ci(b, conf = 0.90) Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o, : estimated adjustment 'a' is NA In addition: Warning messages: 1: In norm.inter(t, (1 + c(conf, -conf))/2) : extreme order statistics used as endpoints 2: In boot.ci(b, conf = 0.9) : bootstrap variances needed for studentized intervals 3: In norm.inter(t, alpha) : extreme order statistics used as endpoints nReal - 25 b - boot(s, varf, R = nReal) # bootstrap bv[i] - mean(b$t) # bootstrap mean variance bias[i] - bv[i] - v[i] # bias estimate bCI90 - boot.ci(b, conf = 0.90) Warning messages: 1: In boot.ci(b, conf = 0.9) : bootstrap variances needed for studentized intervals 2: In norm.inter(t, adj.alpha) : extreme order statistics used as endpoints The problem is that doing 10 resamples generates an NA in the estimation of the 'acceleration' in the function abc.ci(), but doing 25 resamples does not. This implies a connection between the number of resamples and the 'acceleration' which should not exist. ('Acceleration' should be obtained from the original sample via jackknife as 1/6 the coefficient of skewness.) The script apparently works correctly if the number of resamples equals or exceeds the number of original data, but not otherwise. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Possible error in BCa method for confidence intervals in package 'boot'
I'm using R 2.15.1 on a 64-bit machine with Windows 7 Home Premium. Sample problem (screwy subscripted syntax is a relic of edited down a more complex script): N - 25 s - rlnorm(N, 0, 1) require(boot) Loading required package: boot v - NULL # hold sample variance estimates i - 1 v[i] - var(s) # get sample variance nReal - 10 varf - function (x,i) { var(x[i]) } fabc - function (x, w) { # weighted average (biased) variance + sum(x^2 * w) / sum(w) - (sum(x * w) / sum(w))^2 + } p - c(.25, .75, .2, .8, .15, .85, .1, .9, .05, .95, .025, .975, .005, .995) cl - c(0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99) b - boot(s, varf, R = nReal) # bootstrap bv - NULL # hold bootstrap mean variance estimates bias - NULL #hold bias estimates bv[i] - mean(b$t) # bootstrap mean variance bias[i] - bv[i] - v[i] # bias estimate bCI90 - boot.ci(b, conf = 0.90) Error in bca.ci(boot.out, conf, index[1L], L = L, t = t.o, t0 = t0.o, : estimated adjustment 'a' is NA In addition: Warning messages: 1: In norm.inter(t, (1 + c(conf, -conf))/2) : extreme order statistics used as endpoints 2: In boot.ci(b, conf = 0.9) : bootstrap variances needed for studentized intervals 3: In norm.inter(t, alpha) : extreme order statistics used as endpoints nReal - 25 b - boot(s, varf, R = nReal) # bootstrap bv[i] - mean(b$t) # bootstrap mean variance bias[i] - bv[i] - v[i] # bias estimate bCI90 - boot.ci(b, conf = 0.90) Warning messages: 1: In boot.ci(b, conf = 0.9) : bootstrap variances needed for studentized intervals 2: In norm.inter(t, adj.alpha) : extreme order statistics used as endpoints The problem is that doing 10 resamples generates an NA in the estimation of the 'acceleration' in the function abc.ci(), but doing 25 resamples does not. This implies a connection between the number of resamples and the 'acceleration' which should not exist. ('Acceleration' should be obtained from the original sample via jackknife as 1/6 the coefficient of skewness.) Looking at the script for abc.ci(), there is an anomalous reference to 'n' in the invocation line, yet 'n' is not an argument, so must be defined more globally before the call. Yet 'n' is defined within the script as the length of 'data', which is referred to as the 'bootstrap' vector in the comments, yet should be the original sample data. This confusion, plus the use of an argument 'eps' as a default 0.001/n in the calculations makes me suspect the programming in the script. The script apparently works correctly if the number of resamples equals or exceeds the number of original data, but not otherwise. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] coxph() - unexpected result using Crawley's seedlings data (The R Book)
Did you create the 'status' variable the way indicated on p. 797? Frequently with Surv() it pays to use syntax such as Surv(death, status==1) to make a clear logical statement of what is an event (status==1) vs. censored. PS. Next time include head(seedlings) and str(seedlings) to make clear what you are using as data. At 06:51 AM 6/28/2011, Jacob Brogren wrote: Hi, I ran the example on pp. 799-800 from Machael Crawley's The R Book using package survival v. 2.36-5, R 2.13.0 and RStudio 0.94.83. The model is a Cox's Proportional Hazards model. The result was quite different compared to the R Book. I have compared my code to the code in the book but can not find any differences in the function call. My results are attached as well as a link to the results presented in the book (link to Google Books). When running the examples on pp. 797-799 I can't detect any differences in results so I don't think there are errors in the data set or in the creation of the status variable. - Original from the R Book: http://books.google.com/books?id=8D4HVx0apZQClpg=PA799ots=rQgd_8ofeSdq=r%20coxph%20crawleypg=PA799#v=onepageqf=false - My result: summary(model1) Call: coxph(formula = Surv(death, status) ~ strata(cohort) * gapsize, data = seedlings) n= 60, number of events= 60 coef exp(coef) se(coef) z Pr(|z|) gapsize-0.001893 0.998109 0.593372 -0.0030.997 gapsize:strata(cohort)cohort=September 0.717407 2.049112 0.860807 0.8330.405 exp(coef) exp(-coef) lower .95 upper .95 gapsize 0.9981 1.002 0.3120 3.193 gapsize:strata(cohort)cohort=September2.0491 0.488 0.379211.074 Rsquare= 0.022 (max possible= 0.993 ) Likelihood ratio test= 1.35 on 2 df, p=0.5097 Wald test= 1.32 on 2 df, p=0.5178 Score (logrank) test = 1.33 on 2 df, p=0.514 Anyone have an idea why this is occurring? Kind Regards Jacob __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] analysing a three level reponse
The best method would probably be proportional odds regression using polyr() in 'MASS'. At least it's a starting point. At 09:19 AM 6/22/2011, Matt Ellis \(Research\) wrote: Hello, I am struggling to figure out how to analyse a dataset I have inherited (please note this was conducted some time ago, so the data is as it is, and I know it isn't perfect!). A brief description of the experiment follows: Pots of grass were grown in 1l pots of standad potting medium for 1 month with a regular light and watering regime. At this point they were randomly given 1l of one of 4 different pesticides at one of 4 different concentrations (100%, 75%, 50% or 25% in water). There were 20 pots of grass for each pesticide/concentration giving 320 pots. There were no control (untreated) pots. The response was measured after 1 week and recorded as either: B1 - grass dead B2 - grass affected but not dead B3 - no visible effect I could analyse this as lethal effect vs non-lethal effect (B1 vs B2+B3) or some effect vs no effect (B1+B2 vs B3) binomial model, but I can't see how to do it with three levels. Any pointing in the right direction greatly appreciated! Thanks Matt -- Disclaimer: This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify me at matt.el...@basc.org.uk then delete it. BASC may monitor email traffic. By replying to this e-mail you consent to BASC monitoring the content of any email you send or receive from BASC. Any views expressed in this message are those of the individual sender, except where the sender specifies with authority, states them to be the views of the British Association for Shooting and Conservation. BASC can confirm that this email message and any attachments have been scanned for the presence of computer viruses but recommends that you make your own virus checks. Registered Industrial and Provident Society No.: 28488R. Registered Office: Marford Mill, Rossett, Wrexham, LL12 0HL. -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Using MCMC sampling to estimate p values with a mixed model
If your alternative hypothesis is unequal variances (2-sided), both F 1 and F 1 are of interest, and rejection of the equal variance null can occur on either side. The usual ANOVA F test is 1-sided, with an alternative the numerator variance exceeds the denominator one, so this is perhaps why you are confused. At 02:40 PM 6/18/2011, Woodcock, Helena wrote: Hi everyone, Apologies if this is a silly question but I am a student and this is my first time using R so I am still trying to educate myself on commands, models e.t.c I have a mixed model with four dichotomous fixed factors and subject as a random factor (as each person completed four vignettes, with factors crossed across vignettes). I have run an lmer model and used the Monte Carlo method to see if there are any significant main effects or interactions. However, when I looked at the p values some are showing as significant although the F value is less than 1. Is it possible to have a significant effect with an F value below 1?. I have a sample size of 150 and have read that the pMCMC values can be anti-conservative so wonder if it is because my sample size may be too small?. Thank you for any help [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to convert a factor column into a numeric one?
I have a data frame: head(df) Time Temp Conc ReplLog10 10 -20H1 6.406547 22 -20H1 5.738683 37 -20H1 5.796394 4 14 -20H1 4.413691 504H1 6.406547 774H1 5.705433 str(df) 'data.frame': 177 obs. of 5 variables: $ Time : Factor w/ 4 levels 0,2,7,14: 1 2 3 4 1 3 4 1 3 4 ... $ Temp : Factor w/ 4 levels -20,4,25,..: 1 1 1 1 2 2 2 3 3 3 ... $ Conc : Factor w/ 3 levels H,L,M: 1 1 1 1 1 1 1 1 1 1 ... $ Repl : Factor w/ 5 levels 1,2,3,4,..: 1 1 1 1 1 1 1 1 1 1 ... $ Log10: num 6.41 5.74 5.8 4.41 6.41 ... levels(df$Temp) [1] -20 4 25 45 levels(df$Time) [1] 0 2 7 14 As you can see, Time and Temp are currently factors, not numeric. I would like to change these columns into numerical ones. df$Time- as.numeric(df$Time) doesn't work, as it changes to the factor level indices (1,2,3,4) instead of the values (0,2,7,14). There must be a direct way of doing this in R. I tried recode() in 'car': df$Temp- recode(df$Temp, '1=-20;2=25;3=4;4=45',as.factor.result=FALSE) head(df) Time Temp Conc Repl Freq 10 -20H1 6.406547 22 -20H1 5.738683 37 -20H1 5.796394 4 14 -20H1 4.413691 50 45H1 6.406547 77 45H1 5.705433 but note that the values for 'Temp' in rows 5 and 7 are 45 and not 4, as expected, although the result is numeric. The same happens if I use the order given by levels(df$Temp) instead of the sort order in the recode() 2nd argument. Any hints? Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to convert a factor column into a numeric one?
Exactly! Thanks. At 12:49 AM 6/5/2011, Jorge Ivan Velez wrote: Dr. LaBudde, Perhaps as.numeric(as.character(x)) is what you are looking for. HTH, Jorge On Sun, Jun 5, 2011 at 12:31 AM, Robert A. LaBudde wrote: I have a data frame: head(df) Time Temp Conc ReplLog10 10 -20H1 6.406547 22 -20H1 5.738683 37 -20H1 5.796394 4 14 -20H1 4.413691 504H1 6.406547 774H1 5.705433 str(df) 'data.frame': 177 obs. of 5 variables: $ Time : Factor w/ 4 levels 0,2,7,14: 1 2 3 4 1 3 4 1 3 4 ... $ Temp : Factor w/ 4 levels -20,4,25,..: 1 1 1 1 2 2 2 3 3 3 ... $ Conc : Factor w/ 3 levels H,L,M: 1 1 1 1 1 1 1 1 1 1 ... $ Repl : Factor w/ 5 levels 1,2,3,4,..: 1 1 1 1 1 1 1 1 1 1 ... $ Log10: num 6.41 5.74 5.8 4.41 6.41 ... levels(df$Temp) [1] -20 4 25 45 levels(df$Time) [1] 0 2 7 14 As you can see, Time and Temp are currently factors, not numeric. I would like to change these columns into numerical ones. df$Time- as.numeric(df$Time) doesn't work, as it changes to the factor level indices (1,2,3,4) instead of the values (0,2,7,14). There must be a direct way of doing this in R. I tried recode() in 'car': df$Temp- recode(df$Temp, '1=-20;2=25;3=4;4=45',as.factor.result=FALSE) head(df) Time Temp Conc Repl Freq 10 -20H1 6.406547 22 -20H1 5.738683 37 -20H1 5.796394 4 14 -20H1 4.413691 50 45H1 6.406547 77 45H1 5.705433 but note that the values for 'Temp' in rows 5 and 7 are 45 and not 4, as expected, although the result is numeric. The same happens if I use the order given by levels(df$Temp) instead of the sort order in the recode() 2nd argument. Any hints? Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: mailto:r...@lcfltd.comr...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/http://lcfltd.com/ 824 Timberlake Drive Tel: tel:757-467-0954757-467-0954 Virginia Beach, VA 23464-3239Fax: tel:757-467-2947757-467-2947 Vere scire est per causas scire __ mailto:R-help@r-project.orgR-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-helphttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.htmlhttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to convert a factor column into a numeric one?
Thanks for your help. As far as your question below is concerned, the data frame arose as a result of some data cleaning on an original data frame, which was changed into a table, modified, and changed back to a data frame: ttcrmean- as.table(by(ngbe[,'Log10'], list(Time=ngbe$Time,Temp=ngbe$Temp,Conc=ngbe$Conc,Repl=ngbe$Replicate), mean)) for (k in 1:3) { #fix-up time zeroes for (l in 1:5) { #replicates t0val- ttcrmean[1,3,k,l] for (j in 1:4) { #temps ttcrmean[1,j,k,l]- t0val } #j } #l } #i df- na.omit(as.data.frame(ttcrmean)) colnames(df)[5]- 'Log10' At 12:51 AM 6/5/2011, Joshua Wiley wrote: Hi Robert, snip I would also look into *why* those numeric columns are being stored as factors in the first place. If you are reading the data in with read.table() or one of its wrapper functions (like read.csv), then it would be better to preempt the storage as a factor altogether rather than converting back to numeric. For example, perhaps something is being used to indicate missing data that R does not recognize (e.g., SAS uses .). Specifying na.strings = ., would fix this. See ?read.table for some of the options available. snip Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to convert a factor column into a numeric one?
Thanks! Exactly what I wanted, as the same as Jorge also suggested. At 12:49 AM 6/5/2011, Dennis Murphy wrote: Hi: Try this: dd - data.frame(a = factor(rep(1:5, each = 4)), + b = factor(rep(rep(1:2, each = 2), 5)), + y = rnorm(20)) str(dd) 'data.frame': 20 obs. of 3 variables: $ a: Factor w/ 5 levels 1,2,3,4,..: 1 1 1 1 2 2 2 2 3 3 ... $ b: Factor w/ 2 levels 1,2: 1 1 2 2 1 1 2 2 1 1 ... $ y: num 0.6396 1.467 1.8403 -0.0915 0.2711 ... de - within(dd, { + a - as.numeric(as.character(a)) + b - as.numeric(as.character(b)) +} ) str(de) 'data.frame': 20 obs. of 3 variables: $ a: num 1 1 1 1 2 2 2 2 3 3 ... $ b: num 1 1 2 2 1 1 2 2 1 1 ... $ y: num 0.6396 1.467 1.8403 -0.0915 0.2711 ... HTH, Dennis On Sat, Jun 4, 2011 at 9:31 PM, Robert A. LaBudde r...@lcfltd.com wrote: I have a data frame: head(df) Time Temp Conc ReplLog10 10 -20H1 6.406547 22 -20H1 5.738683 37 -20H1 5.796394 4 14 -20H1 4.413691 504H1 6.406547 774H1 5.705433 str(df) 'data.frame': 177 obs. of 5 variables: $ Time : Factor w/ 4 levels 0,2,7,14: 1 2 3 4 1 3 4 1 3 4 ... $ Temp : Factor w/ 4 levels -20,4,25,..: 1 1 1 1 2 2 2 3 3 3 ... $ Conc : Factor w/ 3 levels H,L,M: 1 1 1 1 1 1 1 1 1 1 ... $ Repl : Factor w/ 5 levels 1,2,3,4,..: 1 1 1 1 1 1 1 1 1 1 ... $ Log10: num 6.41 5.74 5.8 4.41 6.41 ... levels(df$Temp) [1] -20 4 25 45 levels(df$Time) [1] 0 2 7 14 As you can see, Time and Temp are currently factors, not numeric. I would like to change these columns into numerical ones. df$Time- as.numeric(df$Time) doesn't work, as it changes to the factor level indices (1,2,3,4) instead of the values (0,2,7,14). There must be a direct way of doing this in R. I tried recode() in 'car': df$Temp- recode(df$Temp, '1=-20;2=25;3=4;4=45',as.factor.result=FALSE) head(df) Time Temp Conc Repl Freq 10 -20H1 6.406547 22 -20H1 5.738683 37 -20H1 5.796394 4 14 -20H1 4.413691 50 45H1 6.406547 77 45H1 5.705433 but note that the values for 'Temp' in rows 5 and 7 are 45 and not 4, as expected, although the result is numeric. The same happens if I use the order given by levels(df$Temp) instead of the sort order in the recode() 2nd argument. Any hints? Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Finding solution set of system of linear equations.
solve() only works for nonsingular systems of equations. Use a generalized inverse for singular systems: A- matrix(c(1,2,1,1, 3,0,0,4, 1,-4,-2,-2, 0,0,0,0), ncol=4, byrow=TRUE) A [,1] [,2] [,3] [,4] [1,]1211 [2,]3004 [3,]1 -4 -2 -2 [4,]0000 b- c(0,2,2,0) #rhs b [1] 0 2 2 0 require('MASS') giA- ginv(A) #M-P generalized inverse giA [,1] [,2][,3] [,4] [1,] 0.667 1.431553e-16 0.0 [2,] 0.333 -1.00e-01 -0.03330 [3,] 0.167 -5.00e-02 -0.01670 [4,] -0.500 2.50e-01 -0.25000 require('Matrix') I- as.matrix(Diagonal(4)) #order 4 identity matrix I [,1] [,2] [,3] [,4] [1,]1000 [2,]0100 [3,]0010 [4,]0001 giA%*%b #particular solution [,1] [1,] 6.67e-01 [2,] -2.67e-01 [3,] -1.33e-01 [4,] -2.220446e-16 giA%*%A - I #matrix for parametric homogeneous solution [,1] [,2] [,3] [,4] [1,] 0.00e+00 0.0 0.0 5.551115e-16 [2,] 3.469447e-17 -0.2 0.4 4.024558e-16 [3,] 4.510281e-17 0.4 -0.8 2.706169e-16 [4,] -3.330669e-16 0.0 0.0 -7.771561e-16 At 09:34 PM 5/21/2011, dslowik wrote: I have a simple system of linear equations to solve for X, aX=b: a [,1] [,2] [,3] [,4] [1,]1211 [2,]3004 [3,]1 -4 -2 -2 [4,]0000 b [,1] [1,]0 [2,]2 [3,]2 [4,]0 (This is ex Ch1, 2.2 of Artin, Algebra). So, 3 eqs in 4 unknowns. One can easily use row-reductions to find a homogeneous solution(b=0) of: X_1 = 0, X_2 = -c/2, X_3 = c, X_4 = 0 and solutions of the above system are: X_1 = 2/3, X_2 = -1/3-c/2, X_3 = c, X_4 = 0. So the Kernel is 1-D spanned by X_2 = -X_3 /2, (nulliity=1), rank is 3. In R I use solve(): solve(a,b) Error in solve.default(a, b) : Lapack routine dgesv: system is exactly singular and it gives the error that the system is exactly singular, since it seems to be trying to invert a. So my question is: Can R only solve non-singular linear systems? If not, what routine should I be using? If so, why? It seems that it would be simple and useful enough to have a routine which, given a system as above, returns the null-space (kernel) and the particular solution. -- View this message in context: http://r.789695.n4.nabble.com/Finding-solution-set-of-system-of-linear-equations-tp3541490p3541490.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] problem with glm(family=binomial) when some levels have only 0 proportion values
The algorithm is not converging. Your iterations are at the maximum. It won't do any good to add a fractional number to all data, as the result will depend on the number added (try 1.0, 0.5 and 0.1 to see this). The root problem is that your data are degenerate. Firstly, your types '2' and '3' are indistinguishable in your data. Secondly, consider the case without 'type'. If you have all zero data for 10 trials, you cannot discriminate among mu = 0, 0.1, 0.0001, 0.001 or 0.01. This leads to numerical instability. Thirdly, the variance estimate in the IRLS will start at 0.0, which gives a singularity. Fundamentally, the algorithm is failing because you are at the boundary of possibilities for a parameter, so special techniques are needed to do maximum likelihood estimation. The simple solution is to deal with the data for your types separately. Another is to do more batches for '2' and '3' to get an observed failure. At 05:01 AM 3/2/2011, Jürg Schulze wrote: Hello everybody I want to compare the proportions of germinated seeds (seed batches of size 10) of three plant types (1,2,3) with a glm with binomial data (following the method in Crawley: Statistics,an introduction using R, p.247). The problem seems to be that in two plant types (2,3) all plants have proportions = 0. I give you my data and the model I'm running: success failure type [1,] 0 103 [2,] 0 102 [3,] 0 102 [4,] 0 102 [5,] 0 102 [6,] 0 102 [7,] 0 102 [8,] 461 [9,] 461 [10,] 371 [11,] 551 [12,] 731 [13,] 461 [14,] 0 103 [15,] 0 103 [16,] 0 103 [17,] 0 103 [18,] 0 103 [19,] 0 103 [20,] 0 102 [21,] 0 102 [22,] 0 102 [23,] 911 [24,] 641 [25,] 461 [26,] 0 103 [27,] 0 103 y- cbind(success, failure) Call: glm(formula = y ~ type, family = binomial) Deviance Residuals: Min 1Q Median 3Q -1.3521849 -0.427 -0.427 -0.427 Max 2.6477556 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept)0.044450.21087 0.2110.833 typeFxC -23.16283 6696.13233 -0.0030.997 typeFxD -23.16283 6696.13233 -0.0030.997 (Dispersion parameter for binomial family taken to be 1) Null deviance: 134.395 on 26 degrees of freedom Residual deviance: 12.622 on 24 degrees of freedom AIC: 42.437 Number of Fisher Scoring iterations: 20 Huge standard errors are calculated and there is no difference between plant type 1 and 2 or between plant type 1 and 3. If I add 1 to all successes, so that all the 0 values disappear, the standard error becomes lower and I find highly significant differences between the plant types. suc- success + 1 fail- 11 - suc Y- cbind(suc,fail) Call: glm(formula = Y ~ type, family = binomial) Deviance Residuals: Min 1Q Median 3Q -1.279e+00 -4.712e-08 -4.712e-08 0.000e+00 Max 2.584e+00 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) 0.2231 0.2023 1.103 0.27 typeFxC -2.5257 0.4039 -6.253 4.02e-10 *** typeFxD -2.5257 0.4039 -6.253 4.02e-10 *** --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 86.391 on 26 degrees of freedom Residual deviance: 11.793 on 24 degrees of freedom AIC: 76.77 Number of Fisher Scoring iterations: 4 So I think the 0 values of all plants of group 2 and 3 are the problem, do you agree? I don't know why this is a problem, or how I can explain to a reviewer why a data transformation (+ 1) is necessary with such a dataset. I would greatly appreciate any comments. Juerg __ Jürg Schulze Department of Environmental Sciences Section of Conservation Biology University of Basel St. Johanns-Vorstadt 10 4056 Basel, Switzerland Tel.: ++41/61/267 08 47 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
Re: [R] [R-sig-ME] Fwd: Re: ANOVA and Pseudoreplication in R
, either Dettol or Garlic. and press them and and incubate them. Then when the zones have appeared after a day or 2. I take 4 diameter measurements from each zone, across the zone at different angles, to take account for the fact, that there may be a weird shape, or not quite circular. I'm concerned about pseudoreplication, such as the multiple readings from one disk, and the 5 lineages - which might be different from one another in each of the Two EV. treatments, but not with NE. treatments. I read that I can remove pseudoreplication from the multiple readings from each disk, by using the 4 readings on each disk, to produce a mean for the disks, and analyse those means - Exerciseing caution where there are extreme values. I think the 3 disks for each lineage themselves are not pseudoreplication, because they are genuinley 3 disks on a plate: the Disk Diffusion Test replicated 3 times - but the multiple readings from one disk if eel, is pseudoreplication. I've also read about including Error() terms in a formula. I'm unsure of the two NE. Treatments comming from the same culture does not introduce pseudoreplications at Treatment Factor Level, because of the two different antimicrobials used have two different effects. I was hoping for a more expert opinion on whether I have identified pseudoreplication correctly or if there is indeed pseudoreplication in the 5 Lineages or anywhere else I haven't seen. And how best this is dealt with in R. At the minute my solution to the multiple readings from one disk is to simply make a new factor, with the means on and do Anova from that, or even take the means before I even load the dataset into R. I'm wondering if an Error() term would be correct. Thanks, Ben W. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] comparing proportions
prop.test() is applicable to a binomial experiment in each of two classes. Your experiment is binomial only at the subject level. You then have multiple subjects in each of your groups. You have a random factor Subjects that must be accounted for. The best way to analyze is a generalized linear mixed model with a binomial distribution family and a logit or probit link. You will probably have to investigate overdispersion. If you have a small number of subjects, and don't care about the among-subject effect, you can model them as fixed effects and use glm() instead. Your original question, I believe, related to doing an ANOVA assuming normality. In order for this to work with this kind of proportion problem, you generally won't get good results unless the number of replicates per subject is 12 or more, and the proportions involved are within 0.15 to 0.85. Otherwise you will have biased confidence intervals and significance tests. At 07:51 PM 2/9/2011, array chip wrote: Content-type: text/plain Content-disposition: inline Content-length: 2969 Hi Bert, Thanks for your reply. If I understand correctly, prop.test() is not suitable to my situation. The input to prop.test() is 2 numbers for each group (# of success and # of trials, for example, groups 1 has 5 success out of 10 trials; group 2 has 3 success out of 7 trials; etc. prop.test() tests whether the probability of success is the same across groups. In my case, each group has several subjects and each subject has 2 numbers (# success and # trials). So for group 1: subject 1: 5 success, 10 trials subject 2: 3 success, 8 trials : : for group 2: subject a: 7 success, 9 trials subject b: 6 success, 7 trials : : I want to test whether the probability of success in group 1 is the same as in group 2. It's like comparing 2 groups of samples using t test, what I am uncertain about is that whether regular t test (or non-pamametric test) is still appropriate here when the response variable is actually proportions. I guess prop.test() can not be used with my dataset, or I may be wrong? Thanks John From: Bert Gunter gunter.ber...@gene.com Sent: Wed, February 9, 2011 3:58:05 PM Subject: Re: [R] comparing proportions 1. Is this a homework problem? 2. ?prop.test 3. If you haven't done so already, get and consult a basic statistical methods book to help you with questions such as this. -- Bert Hi, I have a dataset that has 2 groups of samples. For each sample, then response measured is the number of success (no.success) obatined with the number of trials (no.trials). So a porportion of success (prpop.success) can be computed as no.success/no.trials. Now the objective is to test if there is a statistical significant difference in the proportion of success between the 2 groups of samples (say n1=20, n2=30). I can think of 2 ways to do the test: 1. regular t test based on the variable prop.success 2. Mann-Whitney test based on the variable prop.success 2. do a binomial regression as: fit-glm(cbind(no.success,no.trials-no.success) ~ group, data=data, family=binomial) anova(fit, test='Chisq') My questions is: 1. Is t test appropriate for comparing 2 groups of proportions? 2. how about Mann-Whitney non-parametric test? 3. Among the 3, which technique is more appropriate? 4. any other technique you can suggest? Thank you, John [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] comparing proportions
1. If you use a random effects model, you should make Subject the random factor. I.e., a random intercepts model with 1|Subject. Group is a fixed effect: You have only 2 groups. Even if you had more than 2 groups, treating Group as random would return a standard deviation, not a P-value as you wanted. Finally, I doubt you believe the groups used are meaningless, and only the population of groups is of interest. Instead you consider them special, so Group is a fixed effect. 2. The number of observations for each Subject is the number of trials, which you previously indicated were 7 to 10 in the cases listed. 3. If you have no interest in the Subject effect, you can use a fixed Subject factor instead with glm() instead of glmer() or other mixed model function. This is a good idea so long as the number of subjects is, say, less than 10. Otherwise a mixed model would be a better idea. I suggest you fit all three models to learn about what you're doing: 1) glmer() or equivalent, with cbind(successes, failures) ~ 1|Subject + Group; 2) glm() with cbind(successes, failures) ~ Subject + Group; and 3) lm(p ~ Subject + Group), where p is the proportion success for a particular subject and group. Then compare the results. They will probably all 3 give the same conclusion to the hypothesis question about Group. I would guess the glmer() P-value will be larger, then the glm() and finally the lm(), but the last two may reverse. The lm() model may actually perform fairly well, as the Edgeworth series converges rapidly to normal for binomial distributions with p within 0.15 to 0.85 and 10+ replicates, as I stated before. I'd be interested in seeing the results of these 3 fits myself just for curiosity. At 01:21 PM 2/10/2011, array chip wrote: Robert and Bert, thank you both very much for the response, really appreciated. I agree that using regular ANOVA (or regular t test) may not be wise during the normality issue. So I am debating between generalized linear model using glm(.., family=binomial) or generalized linear mixed effect model using glmer(..., family=binomial). I will forward to Robert an offline list email I sent to Bert about whether using (1|subject) versus (1|group) in mixed model specification. If using (1|group), both models will give me the same testing for fixed effects, which is what I am mainly interested in. So do I really need a mixed model here? Thanks again John From: Bert Gunter gunter.ber...@gene.com To: Robert A LaBudde r...@lcfltd.com Cc: array chip arrayprof...@yahoo.com Sent: Thu, February 10, 2011 10:04:06 AM Subject: Re: [R] comparing proportions Robert: Yes, exactly. In an offlist email exchange, he clarified this for me, and I suggested exactly what you did, also with the cautions that his initial ad hoc suggestions were unwise. His subsequent post to R-help and the sig-mixed-models lists were the result, although he appears to have specified the model incorrectly in his glmer function (as (1|Group) instead of (1|subject). Cheers, Bert On Thu, Feb 10, 2011 at 9:55 AM, Robert A LaBudde mailto:r...@lcfltd.comr...@lcfltd.com wrote: prop.test() is applicable to a binomial experiment in each of two classes. Your experiment is binomial only at the subject level. You then have multiple subjects in each of your groups. You have a random factor Subjects that must be accounted for. The best way to analyze is a generalized linear mixed model with a binomial distribution family and a logit or probit link. You will probably have to investigate overdispersion. If you have a small number of subjects, and don't care about the among-subject effect, you can model them as fixed effects and use glm() instead. Your original question, I believe, related to doing an ANOVA assuming normality. In order for this to work with this kind of proportion problem, you generally won't get good results unless the number of replicates per subject is 12 or more, and the proportions involved are within 0.15 to 0.85. Otherwise you will have biased confidence intervals and significance tests. At 07:51 PM 2/9/2011, array chip wrote: Content-type: text/plain Content-disposition: inline Content-length: 2969 Hi Bert, Thanks for your reply. If I understand correctly, prop.test() is not suitable to my situation. The input to prop.test() is 2 numbers for each group (# of success and # of trials, for example, groups 1 has 5 success out of 10 trials; group 2 has 3 success out of 7 trials; etc. prop.test() tests whether the probability of success is the same across groups. In my case, each group has several subjects and each subject has 2 numbers (# success and # trials). So for group 1: subject 1: 5 success, 10 trials subject 2: 3 success, 8 trials : : for group 2: subject a: 7 success, 9 trials subject b: 6 success, 7 trials : : I want to test whether the probability of success in group 1 is the same as in group 2. It's like
Re: [R] comparing proportions
You need to change models 2 and 3 to use ~ group + subject. You left subject out as a fixed factor. At 05:17 PM 2/10/2011, array chip wrote: Robert, thank you! I tried all 3 models you suggested. Since each subject only has one line of data in my dataset, would including Subject as a factor in glm() or lm() lead to 0 df for resaiduals? Attached is my dataset and here is my version of the 3 models: test-read.table(test.txt,sep='\t',header=T) library(lme4) ## model 1: generalized mixed model fit1-glmer(cbind(success,failure)~group+(1|subject),data=test,family=binomial) ## model 2: generalized model fit2-glm(cbind(success,failure)~group,data=test,family=binomial) ## model 3: linear model fit3-lm(prop.success~group,weights=success+failure,data=test) fit1 Generalized linear mixed model fit by the Laplace approximation Formula: cbind(success, failure) ~ group + (1 | subject) Data: test AIC BIC logLik deviance 54.75 57.89 -24.3848.75 Random effects: Groups NameVariance Std.Dev. subject (Intercept) 1.8256 1.3511 Number of obs: 21, groups: subject, 21 Fixed effects: Estimate Std. Error z value Pr(|z|) (Intercept) -0.6785 0.4950 -1.3710.170 groupgroup2 -0.7030 0.6974 -1.0080.313 summary(fit2) Call: glm(formula = cbind(success, failure) ~ group, family = binomial, data = test) Deviance Residuals: Min 1Q Median 3Q Max -2.8204 -2.0789 -0.5407 1.0403 4.0539 Coefficients: Estimate Std. Error z value Pr(|z|) (Intercept) -0.4400 0.2080 -2.115 0.0344 * groupgroup2 -0.6587 0.3108 -2.119 0.0341 * --- Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 78.723 on 20 degrees of freedom Residual deviance: 74.152 on 19 degrees of freedom summary(fit3) Call: lm(formula = prop.success ~ group, data = test, weights = success + failure) Residuals: Min 1Q Median 3Q Max -1.1080 -0.7071 -0.2261 0.4754 1.9157 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 0.391750.08809 4.447 0.000276 *** groupgroup2 -0.141750.12364 -1.146 0.265838 --- Signif. codes: 0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1 Residual standard error: 0.8676 on 19 degrees of freedom Multiple R-squared: 0.0647, Adjusted R-squared: 0.01548 F-statistic: 1.314 on 1 and 19 DF, p-value: 0.2658 As you can see, all 3 models gave quite different results, the model 2 (generalized linear model) gave significant p value while model 1 (generalized mixed model) and model 3 (regular linear model) did not. Also as you can from the data, prop.success has some value outside the range 0.15 to 0.85, so maybe regular linear model may not be appropriate? Thank you, John From: Robert A LaBudde r...@lcfltd.com To: array chip arrayprof...@yahoo.com Cc: Bert Gunter gunter.ber...@gene.com; r-h...@stat.math.ethz.ch Sent: Thu, February 10, 2011 12:54:44 PM Subject: Re: [R] comparing proportions 1. If you use a random effects model, you should make Subject the random factor. I.e., a random intercepts model with 1|Subject. Group is a fixed effect: You have only 2 groups. Even if you had more than 2 groups, treating Group as random would return a standard deviation, not a P-value as you wanted. Finally, I doubt you believe the groups used are meaningless, and only the population of groups is of interest. Instead you consider them special, so Group is a fixed effect. 2. The number of observations for each Subject is the number of trials, which you previously indicated were 7 to 10 in the cases listed. 3. If you have no interest in the Subject effect, you can use a fixed Subject factor instead with glm() instead of glmer() or other mixed model function. This is a good idea so long as the number of subjects is, say, less than 10. Otherwise a mixed model would be a better idea. I suggest you fit all three models to learn about what you're doing: 1) glmer() or equivalent, with cbind(successes, failures) ~ 1|Subject + Group; 2) glm() with cbind(successes, failures) ~ Subject + Group; and 3) lm(p ~ Subject + Group), where p is the proportion success for a particular subject and group. Then compare the results. They will probably all 3 give the same conclusion to the hypothesis question about Group. I would guess the glmer() P-value will be larger, then the glm() and finally the lm(), but the last two may reverse. The lm() model may actually perform fairly well, as the Edgeworth series converges rapidly to normal for binomial distributions with p within 0.15 to 0.85 and 10+ replicates, as I stated before. I'd be interested in seeing the results of these 3 fits myself just for curiosity. At 01:21 PM 2/10/2011, array chip wrote: Robert and Bert, thank you both very much for the response, really
Re: [R] goodness-of-fit test
Skew as they are, your data certainly don't look normal. Try lognormal. The chi-square test gives good results when all counts are 5 or more, hence the warning. At 12:25 AM 11/12/2010, Andrew Halford wrote: Hi All, I have a dataset consisting of abundance counts of a fish and I want to test if my data are poisson in distribution or normal. My first question is whether it is more appropriate to model my data according to a poisson distribution (if my test says it conforms) or use transformed data to normalise the data distribution? I have been using the vcd package gf-goodfit(Y,type= poisson,method= MinChisq) but i get the following error message Warning message: In optimize(chi2, range(count)) : NA/Inf replaced by maximum positive value I then binned my count data to see if that might help V1 V2 1 5 34 2 10 30 3 15 10 4 20 8 5 25 7 6 30 0 7 35 3 8 40 2 9 45 3 10 50 1 11 55 0 12 60 1 but still received an error message Goodness-of-fit test for poisson distribution X^2 df P( X^2) Pearson 2573372 330 Warning message: In summary.goodfit(gf) : Chi-squared approximation may be incorrect Am I getting caught out because of zero counts or frequencies in my data? Andy -- Andrew Halford Ph.D Associate Research Scientist Marine Laboratory University of Guam Ph: +1 671 734 2948 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to do bootstrap for the complex sample design?
At 01:38 AM 11/4/2010, Fei xu wrote: Hello; Our survey is structured as : To be investigated area is divided into 6 regions, within each region, one urban community and one rural community are randomly selected, then samples are randomly drawn from each selected uran and rural community. The problems is that in urban/rural stratum, we only have one sample. In this case, how to do bootstrap? Any comments or hints are greatly appreciated! Faye Just make a table of your data, with each row corresponding to a measurement. You columns will be Region, UrbanCommunity, RuralCommunity and your response variables. Bootstrap resampling is just generating random row indices into this table, with replacement. I.e., index- sample(1:N, N, replace=TRUE) Then your resample is myTable[index,]. Because you chose UrbanCommunity and RuralCommunity randomly, this shouldn't be a problem. The fact that you choose a subsample size of 1 means you won't be able to estimate within-region variances unless you make some serious assumptions (e.g., UrbanCommunity effect independent of Region effect). Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] When to use bootstrap confidence intervals?
1. Bootstrap will do poorly for small sample sizes (less than 25 or so). Parametric methods (t) have the advantage of working down to a sample size of less than 5. 2. You need to have the number of resamples reasonably large, say 10,000, for poorly behaved distributions. Otherwise extreme percentiles of the resampling distribution used in calculations are too inaccurate. 3. You examples are pathological. For the normal case, the t C.I. should be optimal, so no surprise there. For the Student t(df=3) case, you have a near singular case with only a couple of moments that exist. Try something skewed (lognormal) or bimodal (mixture) as a better example. 4. BCA general gives the best results, but only when the sample size is moderately large ( 25) and the number of resamples is large (several thousand). At 07:09 AM 8/16/2010, Mark Seeto wrote: Hello, I have a question regarding bootstrap confidence intervals. Suppose we have a data set consisting of single measurements, and that the measurements are independent but the distribution is unknown. If we want a confidence interval for the population mean, when should a bootstrap confidence interval be preferred over the elementary t interval? I was hoping the answer would be always, but some simple simulations suggest that this is incorrect. I simulated some data and calculated 95% elementary t intervals and 95% bootstrap BCA intervals (with the boot package). I calculated the proportion of confidence intervals lying entirely above the true mean, the proportion entirely below the true mean, and the proportion containing the true mean. I used a normal distribution and a t distribution with 3 df. library(boot) samplemean - function(x, ind) mean(x[ind]) ci.norm - function(sample.size, n.samples, mu=0, sigma=1, boot.reps) { t.under - 0; t.over - 0 bca.under - 0; bca.over - 0 for (k in 1:n.samples) { x - rnorm(sample.size, mu, sigma) b - boot(x, samplemean, R = boot.reps) bci - boot.ci(b, type=bca) if (mu mean(x) - qt(0.975, sample.size - 1)*sd(x)/sqrt(sample.size)) t.under - t.under + 1 if (mu mean(x) + qt(0.975, sample.size - 1)*sd(x)/sqrt(sample.size)) t.over - t.over + 1 if (mu bci$bca[4]) bca.under - bca.under + 1 if (mu bci$bca[5]) bca.over - bca.over + 1 } return(list(t = c(t.under, t.over, n.samples - (t.under + t.over))/n.samples, bca = c(bca.under, bca.over, n.samples - (bca.under + bca.over))/n.samples)) } ci.t - function(sample.size, n.samples, df, boot.reps) { t.under - 0; t.over - 0 bca.under - 0; bca.over - 0 for (k in 1:n.samples) { x - rt(sample.size, df) b - boot(x, samplemean, R = boot.reps) bci - boot.ci(b, type=bca) if (0 mean(x) - qt(0.975, sample.size - 1)*sd(x)/sqrt(sample.size)) t.under - t.under + 1 if (0 mean(x) + qt(0.975, sample.size - 1)*sd(x)/sqrt(sample.size)) t.over - t.over + 1 if (0 bci$bca[4]) bca.under - bca.under + 1 if (0 bci$bca[5]) bca.over - bca.over + 1 } return(list(t = c(t.under, t.over, n.samples - (t.under + t.over))/n.samples, bca = c(bca.under, bca.over, n.samples - (bca.under + bca.over))/n.samples)) } set.seed(1) ci.norm(sample.size = 10, n.samples = 1000, boot.reps = 1000) $t [1] 0.019 0.026 0.955 $bca [1] 0.049 0.059 0.892 ci.norm(sample.size = 20, n.samples = 1000, boot.reps = 1000) $t [1] 0.030 0.024 0.946 $bca [1] 0.035 0.037 0.928 ci.t(sample.size = 10, n.samples = 1000, df = 3, boot.reps = 1000) $t [1] 0.018 0.022 0.960 $bca [1] 0.055 0.076 0.869 Warning message: In norm.inter(t, adj.alpha) : extreme order statistics used as endpoints ci.t(sample.size = 20, n.samples = 1000, df = 3, boot.reps = 1000) $t [1] 0.027 0.014 0.959 $bca [1] 0.054 0.047 0.899 I don't understand the warning message, but for these examples, the ordinary t interval appears to be better than the bootstrap BCA interval. I would really appreciate any recommendations anyone can give on when bootstrap confidence intervals should be used. Thanks, Mark -- Mark Seeto National Acoustic Laboratories, Australian Hearing __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal
Re: [R] Function to compute the multinomial beta function?
At 05:10 PM 7/5/2010, Gregory Gentlemen wrote: Dear R-users, Is there an R function to compute the multinomial beta function? That is, the normalizing constant that arises in a Dirichlet distribution. For example, with three parameters the beta function is Beta(n1,n2,n2) = Gamma(n1)*Gamma(n2)*Gamma(n3)/Gamma(n1+n2+n3) beta3- function (n1, n2, n3) exp(lgamma(n1)+lgamma(n2)+lgamma(n3)-lgamma(n1+n2+n3)) beta3(5,3,8) [1] 1.850002e-07 Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Comparing distributions
Your * curve apparently dominates your + curve. If they have the same total number of data each, as you say, they both cannot sum to the same value (e.g., N = 1 or 1.000). So there is something going on that you aren't mentioning. Try comparing CDFs instead of pdfs. At 03:33 PM 6/23/2010, Ralf B wrote: I am trying to do something in R and would appreciate a push into the right direction. I hope some of you experts can help. I have two distributions obtrained from 1 datapoints each (about 1 datapoints each, non-normal with multi-model shape (when eye-balling densities) but other then that I know little about its distribution). When plotting the two distributions together I can see that the two densities are alike with a certain distance to each other (e.g. 50 units on the X axis). I tried to plot a simplified picture of the density plot below: | | * | * * | *+ * | * + + * | *+ * ++ * | *+* + * + + * | * + * + +* | * + +* |* ++* | * + + * | * + + * |___ What I would like to do is to formally test their similarity or otherwise measure it more reliably than just showing and discussing a plot. Is there a general approach other then using a Mann-Whitney test which is very strict and seems to assume a perfect match. Is there a test that takes in a certain 'band' (e.g. 50,100, 150 units on X) or are there any other similarity measures that could give me a statistic about how close these two distributions are to each other ? All I can say from eye-balling is that they seem to follow each other and it appears that one distribution is shifted by a amount from the other. Any ideas? Ralf __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Singularity in simple ANCOVA problem
I'm using R 2.10.1 with the latest version of all packages (updated today). I'm confused as to why I'm getting a hard singularity in a simple set of experimental data: blots ID Lot Age Conc 1 1 A 3 4.44 2 2 A 3 4.56 3 3 B 41 4.03 4 4 B 41 4.57 5 5 C 229 4.49 6 6 C 229 4.66 7 7 D 238 3.88 8 8 D 238 3.93 9 9 E 349 4.43 10 10 E 349 4.22 11 11 F 391 4.42 12 12 F 391 4.46 fit2- lm(Conc ~ Age + Lot, data=blots) summary(fit2) Call: lm(formula = Conc ~ Age + Lot, data = blots) Residuals: Min 1Q Median 3QMax -2.700e-01 -6.625e-02 6.072e-17 6.625e-02 2.700e-01 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(|t|) (Intercept) 4.5004639 0.1273234 35.347 3.42e-08 *** Age -0.0001546 0.0004605 -0.336 0.7485 LotB-0.1941237 0.1706005 -1.138 0.2986 LotC 0.1099485 0.1554378 0.707 0.5059 LotD-0.5586598 0.1558853 -3.584 0.0116 * LotE-0.1214948 0.1698331 -0.715 0.5013 LotFNA NA NA NA --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.1787 on 6 degrees of freedom Multiple R-squared: 0.7464, Adjusted R-squared: 0.535 F-statistic: 3.532 on 5 and 6 DF, p-value: 0.07811 Why the NA's here? Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Singularity in simple ANCOVA problem
At 08:44 PM 6/20/2010, RICHARD M. HEIBERGER wrote: You have only one Age value in each Lot. Hence the Age is aliased with one of the 5 degrees of freedom between lots. You can see it in the picture xyplot(Conc ~ Age|Lot, data=blots, pch=16) There is no opportunity for a slope within each of the groups. Rich Doh! Thanks. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Popularity of R, SAS, SPSS, Stata...
At 09:01 PM 6/20/2010, Ted Harding wrote: On 20-Jun-10 19:49:43, Hadley Wickham wrote: Whales are a different kettle of fish! They are much more directly observable, in principle, than are R-users. For one thing, a whale has to come to the surface to breathe every so often, and if you are in a ship nearby you can see it happen. snip Once thing both whales and R users have in common is that, when you sight one, you say R, Matey! Thar she blows! Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logistic regression with 50 varaibales
degli Studi di Trieste Via Alfonso Valerio 6/a I-34127 Trieste phone: +39 0 40 5 58-37 68 email: cbelei...@units.it __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] glm poisson function
prop.test(c(271,433),c(6164113,5572041)) 2-sample test for equality of proportions with continuity correction data: c(271, 433) out of c(6164113, 5572041) X-squared = 54.999, df = 1, p-value = 1.206e-13 alternative hypothesis: two.sided 95 percent confidence interval: -4.291428e-05 -2.457623e-05 sample estimates: prop 1 prop 2 4.396415e-05 7.770941e-05 At 06:36 AM 6/10/2010, Phender79 wrote: Hi, I'm totally new to R so I apologise for the basic request. I am looking at the incidence of a disease over two time periods 1990-1995 and 2003-2008. I have counts for each year, subdivided into three disease categories and by males/females. I understand that I need to analyse the data using poisson regression and have managed to use the pois.daly function to get age-sex adjusted rates and corresponding confidence intervals. However, I now want to know how get a p value (I'm writing up a paper for a journal) to say that x number of cases in the first cohort (1990-1995) is significantly lower than y number in the later cohort (2003-2008). I also want to make sure that I've corrected for overdispersion etc. I'm totally stuck and can't think where to start with writing a script. So basically my question is: e.g. I have 271 cases of the disease between 1990-1995 (total population at risk over six years = 6,164,113) and 433 cases between 2003-2008 (total population at risk over sic year = 5,572,041) - is this significant and what is the P value. Any help much appreciated! Cheers P -- View this message in context: http://r.789695.n4.nabble.com/glm-poisson-function-tp2250210p2250210.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] When normality test violate and sample size is large(n=400) can I use t-test
At 11:45 AM 6/1/2010, Kourosh Ks wrote: Dears When normality test violate and sample size is large(n=400) can I use t-test? best gards kourosh Generally yes, unless there is something really pathological about the distribution. You should note that, for n = 400, even the simplest distribution-free tests should have near perfect power, so which test you use is not important. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Histogram Bin
x- rnorm(200) hist(x, 18) str(hist(x, 18)) List of 7 $ breaks : num [1:15] -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 ... $ counts : int [1:14] 3 1 8 12 34 35 40 30 18 11 ... $ intensities: num [1:14] 0.03 0.01 0.08 0.12 0.34 ... $ density: num [1:14] 0.03 0.01 0.08 0.12 0.34 ... $ mids : num [1:14] -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 ... $ xname : chr x $ equidist : logi TRUE - attr(*, class)= chr histogram hist(x, 18, plot=FALSE)$breaks [1] -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 At 09:55 AM 5/14/2010, Research wrote: Hello, Is there a function that returns the number of the bin (or quantile, or percentile etc. etc.) that a value of a variable may belong to? Tor example: breaks-hist(variable, 18, plot=FALSE) If the following breaks are 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 the boundaries of successive bins of a histogram, then value 6 belongs to the 2nd bin. Best regards, Costas __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Cubic B-spline, how to numerically integrate?
It's a piecewise cubic function. All you need are the knots and the coefficients. You can then right down the analytic formula for the integral. It doesn't need to be numerically integrated. At 11:57 AM 5/14/2010, David Winsemius wrote: On May 14, 2010, at 11:41 AM, Claudia Penaloza wrote: (corrected version of previous posting) I fit a GAM to turtle growth data following methods in Limpus Chaloupka 1997 (http://www.int-res.com/articles/meps/149/m149p023.pdf). I want to obtain figures similar to Fig 3 c f in Limpus Chaloupka (1997), which according to the figure legend are expected size-at-age functions obtained by numerically integrating size-specific growth rate functions derived using cubic B-spline fit to GAM predicted values. I was able to fit the cubic-B spline, but I do not know how to numerically integrate it. You need to give us the function and the appropriate limits of integration. Can anybody help please? Code and figures here: https://docs.google.com/fileview?id=0B1cQ7z9xYFl2ZTZhMmMyMjAtYTA3Zi00N2QyLTkxNzMtOGYyMjdiOGU2ZWE4hl=en Thank you, Claudia Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] P values
At 07:10 AM 5/7/2010, Duncan Murdoch wrote: Robert A LaBudde wrote: At 01:40 PM 5/6/2010, Joris Meys wrote: On Thu, May 6, 2010 at 6:09 PM, Greg Snow greg.s...@imail.org wrote: Because if you use the sample standard deviation then it is a t test not a z test. I'm doubting that seriously... You calculate normalized Z-values by substracting the sample mean and dividing by the sample sd. So Thomas is correct. It becomes a Z-test since you compare these normalized Z-values with the Z distribution, instead of the (more appropriate) T-distribution. The T-distribution is essentially a Z-distribution that is corrected for the finite sample size. In Asymptopia, the Z and T distribution are identical. And it is only in Utopia that any P-value less than 0.01 actually corresponds to reality. I'm not sure what you mean by this. P-values are simply statistics calculated from the data; why wouldn't they be real if they are small? Do you truly believe an actual real-life distribution accurately is fit by a normal distribution at quantiles of 0.001, 0.0001 or beyond? The map is not the territory, and just because you can calculate something from a model doesn't mean it's true. The real world is composed of mixture distributions, not pure ones. The P-value may be real, but its reality is subordinate to the distributional assumption involved, which always fails at some level. I'm simply asserting that level is in the tails at probabilities of 0.01 or less. Statisticians, even eminent ones such as yourself and lesser lights such as myself, frequently fail to keep this in mind. We accept such assumptions as normality, equal variances, etc., on an eyeballometric basis, without any quantitative understanding of what this means about limitations on inference, including P-values. Inference in statistics is much cruder and more judgmental than we like to portray. We should at least be honest among ourselves about the degree to which our hand-waving assumptions work. I remember at the O. J. Simpson trial, the DNA expert asserted that a match would occur only once in 7 billion people. I wondered at the time how you could evaluate such an assertion, given there were less than 7 billion people on earth at the time. When I was at a conference on optical disk memories when they were being developed, I heard a talk about validating disk specifications against production. One statement was that the company would also validate the undetectable error rate specification of 1 in 10^16 bits. I amusingly asked how they planned to validate the undetectable error rate. The response was handwaving and Just as we do everything else. The audience laughed, and the speaker didn't seem to know what the joke was. In both these cases the values were calculable, but that didn't mean that they applied to reality. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] P values
At 01:40 PM 5/6/2010, Joris Meys wrote: On Thu, May 6, 2010 at 6:09 PM, Greg Snow greg.s...@imail.org wrote: Because if you use the sample standard deviation then it is a t test not a z test. I'm doubting that seriously... You calculate normalized Z-values by substracting the sample mean and dividing by the sample sd. So Thomas is correct. It becomes a Z-test since you compare these normalized Z-values with the Z distribution, instead of the (more appropriate) T-distribution. The T-distribution is essentially a Z-distribution that is corrected for the finite sample size. In Asymptopia, the Z and T distribution are identical. And it is only in Utopia that any P-value less than 0.01 actually corresponds to reality. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random numbers with PDF of user-defined function
At 05:40 AM 4/29/2010, Nick Crosbie wrote: Hi, In S+/R, is there an easy way to generate random numbers with a probability distribution specified by an exact user-defined function? For example, I have a function: f(x) = 1/(365 * x), which should be fitted for values of x between 1 and 100,000 How do I generate random numbers with a probability distribution that exactly maps the above function? Nick First of all, your pdf should be f(x) = 1 / [x log(10)], if x is continuous. Second, compute the cdf as F(x) = ln(x) / log(10). Third, compute the inverse cdf as G(p) = exp[p log(10)] Finally, to generate random variates, use G(u), where u is a uniform random variate in [0,1]. In R, G- function (p) exp(p*log(10)) G(runif(5)) [1] 11178.779736 9475.748549 65939.487801 94914.354479 1.694695 Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] From THE R BOOK - Warning: In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!
At 12:08 PM 3/30/2010, David Winsemius wrote: snip I don't understand this perspective. You bought Crowley's book so he is in some minor sense in debt to you. Why should you think it is more appropriate to send your message out to thousands of readers of r- help around the world (some of whom have written books that you did not buy) before sending Crowley a question about his text? In fairness to Michael Crawley, whose books are useful and very clear (although not well-liked on this list for some reason): 1. The example quoted by Corrado Topi is not an actual example. Instead is an isolated line of code given to illustrate the simplicity of glm() syntax and its relation to lm() syntax. This is in a short general topic overview chapter on GLMs meant to introduce concepts and terminology, not runnable code. 2. The example chapter is followed in the book by individual chapters on each type of GLM covered (count data, count data in tables, proportion data, binary response variables). If Corrado Topi had looked in the relevant chapter, he would find numerous worked out examples with runnable code. Corrado Topi made an error in trying to run an isolated line of code without antecedent definitions, which almost never works in any programming system. Michael Crawley made a mistake in judgment in assuming that detail later will suffice for generality now. My advice to Corrado Topi is engage in some forward referencing, and read chapters 16 and 17 before deciding which example code to run. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Nonparametric generalization of ANOVA
A search on bluesky...@gmail.com shows the user is in Norfolk, VA, USA. At 01:26 PM 3/5/2010, John Sorkin wrote: The sad part of this interchanges is that Blue Sky does not seem to be amiable to suggestion. He, or she, has not taken note, or responded to the fact that a number of people believe it is good manners to give a real name and affiliation. My mother taught me that when two people tell you that you are drunk you should lie down until the inebriation goes away. Blue Sky, several people have noted that you would do well to give us your name and affiliation. Is this too much to ask given that people are good enough to help you? John John David Sorkin M.D., Ph.D. Chief, Biostatistics and Informatics University of Maryland School of Medicine Division of Gerontology Baltimore VA Medical Center 10 North Greene Street GRECC (BT/18/GR) Baltimore, MD 21201-1524 (Phone) 410-605-7119 (Fax) 410-605-7913 (Please call phone number above prior to faxing) Matthew Dowle mdo...@mdowle.plus.com 3/5/2010 12:58 PM Frank, I respect your views but I agree with Gabor. The posting guide does not support your views. It is not any of our views that are important but we are following the posting guide. It covers affiliation. It says only that some consider it good manners to include a concise signature specifying affiliation. It does not agree that it is bad manners not to. It is therefore going too far to urge R-gurus, whoever they might be, to ignore such postings on that basis alone. It is up to responders (I think that is the better word which is the one used by the posting guide) whether they reply. Missing affiliation is ok by the posting guide. Users shouldn't be put off from posting because of that alone. Sending from an anonymous email address such as BioStudent is also fine by the posting guide as far as my eyes read it. It says only that the email address should work. I would also answer such anonymous posts, providing they demonstrate they made best efforts to follow the posting guide, as usual for all requests for help. Its so easy to send from a false, but apparently real name, why worry about that? If you disagree with the posting guide then you could make a suggestion to get the posting guide changed with respect to these points. But, currently, good and practice is defined by the posting guide, and I can't see that your view is backed up by it. In fact it seems to me that these points were carefully considered, and the wording is careful on these points. As far as I know you are wrong that there is no moderator. There are in fact an uncountable number of people who are empowered to moderate i.e. all of us. In other words its up to the responders to moderate. The posting guide is our guide. As a last resort we can alert the list administrator (which I believe is the correct name for him in that role), who has powers to remove an email address from the list if he thinks that is appropriate, or act otherwise, or not at all. It is actually up to responders (i.e. all of us) to ensure the posting guide is followed. My view is that the problems started with some responders on some occasions. They sometimes forgot, a little bit, to encourage and remind posters to follow the posting guide when it was not followed. This then may have encouraged more posters to think it was ok not to follow the posting guide. That is my own personal view, not a statistical one backed up by any evidence. Matthew Frank E Harrell Jr f.harr...@vanderbilt.edu wrote in message news:4b913880.9020...@vanderbilt.edu... Gabor Grothendieck wrote: I am happy to answer posts to r-help regardless of the name and email address of the poster but would draw the line at someone excessively posting without a reasonable effort to find the answer first or using it for homework since such requests could flood the list making it useless for everyone. Gabor I respectfully disagree. It is bad practice to allow anonymous postings. We need to see real names and real affiliations. r-help is starting to border on uselessness because of the age old problem of the same question being asked every two days, a high frequency of specialty questions, and answers given with the best of intentions in incremental or contradictory e-mail pieces (as opposed to a cumulative wiki or hierarchically designed discussion web forum), as there is no moderator for the list. We don't need even more traffic from anonymous postings. Frank On Fri, Mar 5, 2010 at 10:55 AM, Ravi Varadhan rvarad...@jhmi.edu wrote: David, I agree with your sentiments. I also think that it is bad posting etiquette not to sign one's genuine name and affiliation when asking for help, which blue sky seems to do a lot. Bert Gunter has already raised this issue, and I completely agree with him. I would also like to urge the R-gurus to ignore such postings. Best, Ravi.
Re: [R] Help with simple bootstrap test
The boot() function in the 'boot' package expects to find a function for the statistic with two arguments: The data object plus a row index object. You don't indicate enough to see how you will be resampling. It you sum all elements in your table, resampling would have to be one of: 1. A sample with replacement of rows, or 2. A sample with replacement of columns, or 3. A sample with replacement of elements in the whole table. Assuming you want sampling with replacement of rows: sumf- function (x, i) { sum(x[i,]) } result- boot(mytable, sumf, 1000) boot.ci(result, type='bca') A simulation: mytable- matrix(rnorm(2000), ncol=20) sum(mytable) [1] -14.92842 sumf- function(x,i) sum(x[i,]) require('boot') Loading required package: boot b1- boot(mytable, sumf, 1000) boot.ci(b1, type='bca') BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = b1, type = bca) Intervals : Level BCa 95% (-101.49, 85.20 ) Calculations and Intervals on Original Scale At 01:28 PM 2/25/2010, xonix wrote: Hi all Forgive me, I'm a total R newbie, and this seems to be a straightforward simple bootstrap problem, but after a whole day of trying to figure out how to do it I'm ready to give up. Part of the problem is that every example and every help page seems to be about doing something more far more complex. I'm got a table with 40 columns and 750 rows. I sum all the values across the whole table (and subsets of the columns). I want to bootstrap to get the 95% confidence intervals for that sum value. result - boot(table, function, 1000) boot.ci (result, bca) It seems to me that the 'function' is something to sum all the columns and rows of the table (or a subset should I desire). I've tried writing 'sum' for the function, but this gives me a huge figure which can't possibly be right. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] multiple-test correlation
At 07:35 AM 2/14/2010, Manuel Jesús López Rodríguez wrote: Dear all, I am trying to study the correlation between one independent variable (V1) and several others dependent among them (V2,V3,V4 and V5). For doing so, I would like to analyze my data by multiple-test (applying the Bonferroni´s correction or other similar), but I do not find the proper command in R. What I want to do is to calculate Kendall´s correlation between V1 and the others variables (i.e. V1 vs V2, V1 vs V3, etc.) and to correct the p values by Bonferroni or other. I have found outlier.test, but I do not know if this is what I need (also, I would prefer to use a less conservative method than Bonferroni´s, if possible). Thank you very much in advance! One approach might be to first test for any correlations via a likelihood ratio test: Ho: P = I (no correlations) or covariances are diagonal T = -a ln V ~ chi-square [p(p-1)/2] where V = det(R) a = N -1 - (2 p +5)/6 N = # data p = # variables Reject Ho if T X^2 (alpha, p(p-1)/2) Then do the pairwise tests without familywise error control. I.e., this is similar to doing the F test in ANOVA before doing LSD testing. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] random effects in mixed model not that 'random'
acting as grouping variable for a random effect intercept. If I have no knowledge about the schools, the random effect assumption makes sense. If I however investigate the schools in detail (either a priori or a posterior), say teaching quality of the teachers, socio-economic status of the school area etc, it will probably make sense to predict which ones will have pupils performing above average, and which below average. However then probably these factors leading me to the predictions should enter the model as fixed effects, and maybe I don't need and school random effect any more at all. But this means actually the school deviation from the global mean is not the realization of a random variable, but instead the result of something quite deterministic, but which is usually just unknown, or can only be measured with extreme, impractical efforts. So the process might not be random, just because so little is known about the process, the results appear as if they would be randomly drawn (from a larger population distribution). Again, is ignorance / lack of deeper knowledge the key to using random effects - and the more knowledge I have, the less ? many thanks, Thomas __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Implementation of gamma function for large number
At 04:53 PM 10/24/2009, R_help Help wrote: Hi - I ran into a problem when the argument to gamma function is large. Says, if I do the following: gamma(17000) [1] Inf Warning message: value out of range in 'gammafn' Is there anyway to get around this or other implementation? Thank you. -rc Try the log of the gamma function instead: ? gamma lgamma(17000) [1] 148592.5 (17000-0.5)*log(17000)-17000+0.5*log(2*pi) [1] 148592.5 Note that, for this size number, the first few terms of the usual expansions gives the answer to 7 significant figures. I will restrain myself from asking the obvious question of what possible use gamma(17000) could be to you, and why a simple infinity would not work just as well. I'm sure you must have good reason for your request. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Confusion on use of permTS() in 'perm'
Consider the following simple example using R-2.9.0 and 'perm' 2.9.1: require('perm') p- c(15,21,26,32,39,45,52,60,70,82) g- c('y','n','y','y', rep('n',6)) #Patients ranked 1,3,4 receive treatment permTS(p ~ g, alternative = 'two.sided', method='exact.ce') #find p-value by complete enumeration Exact Permutation Test (complete enumeration) data: p by g p-value = 0.05 alternative hypothesis: true mean of g=n minus mean of g=y is not equal to 0 sample estimates: mean of g=n minus mean of g=y 28.38095 The permutation observed is '134', which has a rank sum of 8. Other permutations with rank sums of 8 or less are '123', '124' and '125'. So there are a total of 4 out of 4! = 120 possible, or a one-tail p-value of 4/120 = 0.0333, or a 2-tail p-value of 2*4/120 = 0.067. This is not, however, what permTS() returns. The permTS() value of 0.05 appears to correspond to 3 patterns, not 4. I am misunderstanding how to solve this simple problem, or is something going on with permTS() that I'm missing. Thanks. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Problem accessing functions in package 'roxygen'
Thanks, Uwe. Unfortunately, that doesn't work either: library('roxygen') Warning message: package 'roxygen' was built under R version 2.9.1 roxygen::trim(' 1234 ') Error: 'trim' is not an exported object from 'namespace:roxygen' I ended up using trim - function(x) gsub(^[[:space:]]+|[[:space:]]+$, , x) instead. At 01:42 PM 9/3/2009, Uwe Ligges wrote: Robert A. LaBudde wrote: I have Vista Home with R-2.9.0, and installed and tried to test the package 'roxygen': utils:::menuInstallPkgs() trying URL 'http://lib.stat.cmu.edu/R/CRAN/bin/windows/contrib/2.9/roxygen_0.1.zip' Content type 'application/zip' length 699474 bytes (683 Kb) opened URL downloaded 683 Kb package 'roxygen' successfully unpacked and MD5 sums checked The downloaded packages are in C:\Users\RAL\AppData\Local\Temp\RtmpZPlILq\downloaded_packages updating HTML package descriptions Warning message: In file.create(f.tg) : cannot create file 'C:\PROGRA~1\R\R-29~1.0/doc/html/packages.html', reason 'Permission denied' library('roxygen') Warning message: package 'roxygen' was built under R version 2.9.1 trim( 1234) Error: could not find function trim I have a similar problem with trim.right(), trim.left() and other functions I've tried. Any ideas? Probably it is not intende to call trim and friends like that, because they are not exported from roxygen's namespace, hence you could use roxygen:::trim( 1234) Uwe Ligges Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Problem accessing functions in package 'roxygen'
I have Vista Home with R-2.9.0, and installed and tried to test the package 'roxygen': utils:::menuInstallPkgs() trying URL 'http://lib.stat.cmu.edu/R/CRAN/bin/windows/contrib/2.9/roxygen_0.1.zip' Content type 'application/zip' length 699474 bytes (683 Kb) opened URL downloaded 683 Kb package 'roxygen' successfully unpacked and MD5 sums checked The downloaded packages are in C:\Users\RAL\AppData\Local\Temp\RtmpZPlILq\downloaded_packages updating HTML package descriptions Warning message: In file.create(f.tg) : cannot create file 'C:\PROGRA~1\R\R-29~1.0/doc/html/packages.html', reason 'Permission denied' library('roxygen') Warning message: package 'roxygen' was built under R version 2.9.1 trim( 1234) Error: could not find function trim I have a similar problem with trim.right(), trim.left() and other functions I've tried. Any ideas? Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] online classes or online eduction in statistics? esp. time series analysis and cointegration?
http://www.statistics.com/ for online courses without college credit, taught by well known faculty typically using their own books. At 01:50 AM 8/31/2009, Luna Laurent wrote: Hi all, I am looking for low cost online education in statistics. I am thinking of taking online classes on time series analysis and cointegration, etc. Of course, if there are free video lectures, that would be great. However I couldn't find any free video lectures at upper-undergraduate and graduate level which formally going through the whole timeseries education... That's why I would like to enroll in some sort of online degree classes. However, I don't want to earn the certificate or the degree; I just want to audit the online class specifically in time series analysis and cointegration. Could anybody recommend such online education in statistics esp. in time series and cointegration, at low cost? Hopefully it's not going to be like a few thousand dollars for one class. Thanks a lot for your pointers in advance! [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] transformation/link glm in R
1. It sounds as if you say left skewed when you mean right skewed and v.v. Right skewed distributions have a long tail to the right. 2. Your 3rd variable may be censored, as you indicate it is bounded at 8 hr and that is also the mode. If so, this will complicate your analysis. 3. Your variables would at first sight appear to be continuous, yet you are modeling them as counts. Why? Did you discretize the times? 4. I suggest you collect the standard transforms related to the different distributions, and do some EDA where you plot the distributions of the transformed data. How much does unimodal symmetry improve? This might point you towards the right family. Using the Box-Cox method to find a power transform may be useful in down-selecting. 5. You might consider the gamma distribution with a 1/x link for the first two varibles. 6. A beta distribution could probably model all 3 of your variables. You have given insufficient information regarding your experiment and its response variables to allow accurate advice to be supplied. Where do the bounds come from? Are they arbitrary experimental cut-offs, causing censoring? If so, would you be better to use survival analysis (e.g., coxph() instead of glm)? Etc. At 06:52 AM 8/18/2009, Mcdonald, Grant wrote: Dear sir, I have 3 different time response variables that are in the form seconds. All three response variables are not normally distributed. They are in the form of mate latency, the first two responses are bounded to 30mins and thwe third is bounded to 8 hours. Frequency plots of raw data show that the first two are heavily skewed to the left and the third (bounded at 8hrs) is heavily skwewed to the right with most data points being 8 hours long. I am unsure of using the appropriate transformations in R and have only found appropriate trandformations for count data (glm(time~x*x1*x2, family=poisson) and proportion data (glm(time~x*x1*x2, family=binomial). Would it be possible to point me in the right direction of the appropriate glm family to use for such data or should I use some transformation seperately and use anova in R of which i am more confident of the code Sorry if this is an innapropriate question but I would greatly appreciate advise, G. Colin __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] bootstrapped correlation confint lower than -1 ?
The basic interval (which you are using by default) is just the usually normal distribution interval with the standard error estimated from the resampling distribution. It has all the benefits and problems of assuming a normal distribution. If you wish to maintain the domain limits on the correlation, use the percentile method, which would use the 2.5% and 97.5% quantiles of the resampling distribution. Because you clearly have a non-central correlation (-0.8), you would get more accurate results with the BCa method, which does bias and skew correction of the resampling distribution. At 06:22 AM 8/16/2009, Liviu Andronic wrote: Dear R users, Does the results below make any sense? Can the the interval of the correlation coefficient be between *-1.0185* and -0.8265 at 95% confidence level? Liviu library(boot) data(mtcars) with(mtcars, cor.test(mpg, wt, met=spearman)) Spearman's rank correlation rho data: mpg and wt S = 10292, p-value = 1.488e-11 alternative hypothesis: true rho is not equal to 0 sample estimates: rho -0.88642 Warning message: In cor.test.default(mpg, wt, met = spearman) : Cannot compute exact p-values with ties corr.fun - function(data, indices) { + data.samp - data[indices,] # allows boot to select sample + tmp.corr - with(data.samp, cor.test(mpg, wt, met=spearman)) + return(tmp.corr$estimate) + } corr.boot - boot(mtcars, corr.fun, R=1000) There were 50 or more warnings (use warnings() to see the first 50) corr.boot ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = mtcars, statistic = corr.fun, R = 1000) Bootstrap Statistics : original biasstd. error t1* -0.88642 0.0129920.050042 boot.ci(corr.boot, type=basic) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = corr.boot, type = basic) Intervals : Level Basic 95% (-1.0185, -0.8265 ) Calculations and Intervals on Original Scale -- Do you know how to read? http://www.alienetworks.com/srtest.cfm Do you know how to write? http://garbl.home.comcast.net/~garbl/stylemanual/e.htm#e-mail __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to test frequency independence (in a 2 by 2 table) with many missing values
1. Drop all subjects where both test results are not present. These are uninformative on the difference. 2. Perform McNemar test on the remaining table. (Only the different pairs are informative.) This will give you a p-value on the data that actually contrasts the two tests. (For your example, the p-value is X2- data.frame(x1.m, x2.m) tX2- table(X2) tX2 x2.m x1.m 0 1 0 6 5 1 15 4 mcnemar.test(tX2) McNemar's Chi-squared test with continuity correction data: tX2 McNemar's chi-squared = 4.05, df = 1, p-value = 0.04417 The low power is a consequence of the data not being informative. You'll have to live with it. If you want to squeeze any more out this data, I'd guess about the only way to do it would be via a maximum likelihood approach with marginals for the one test only data, an assumption of, e.g., a binomial distribution for each test, and then use a likelihood ratio test for p1=p2 vs. p1!=p2. If you really feel up to it, you could program a randomization or bootstrap test equivalent to the ML approach, maintaining the marginal totals involved. At 01:23 PM 7/24/2009, Tal Galili wrote: Hello dear R help group. My question is statistical and not R specific, yet I hope some of you might be willing to help. *Experiment settings*: We have a list of subjects. each of them went through two tests with the answer to each can be either 0 or 1. *Goal*: We want to know if the two experiments yielded different results in the subjects answers. *Statistical test (and why it won't work)*: Naturally we would turn to performing a mcnemar test. But here is the twist: we have missing values in our data. For our purpose, let's assume the missingnes is completely at random, and we also have no data to use for imputation. Also, we have much more missing data for experiment number 2 then in experiment number 1. So the question is, under these settings, how do we test for experiment effect on the outcome? So far I have thought of two options: 1) To perform the test only for subjects that has both values. But they are too scarce and will yield low power. 2) To treat the data as independent and do a pearson's chi square test (well, an exact fisher test that is) on all the non-missing data that we have. The problem with this is that our data is not fully independent (which is a prerequisite to chi test, if I understood it correctly). So I am not sure if that is a valid procedure or not. Any insights will be warmly welcomed. p.s: here is an example code producing this scenario. set.seed(102) x1 - rbinom(100, 1, .5) x2 - rbinom(100, 1, .3) X - data.frame(x1,x2) tX - table(X) margin.table(tX,1) margin.table(tX,2) mcnemar.test(tX) put.missings - function(x.vector, na.percent) { turn.na - rbinom(length(x.vector), 1, na.percent) x.vector[turn.na == 1] - NA return(x.vector) } x1.m - put.missings(x1, .3) x2.m - put.missings(x2, .6) tX.m - rbind(table(na.omit(x1.m)), table(na.omit(x2.m))) fisher.test(tX.m) With regards, Tal -- -- My contact information: Tal Galili Phone number: 972-50-3373767 FaceBook: Tal Galili My Blogs: http://www.r-statistics.com/ http://www.talgalili.com http://www.biostatistics.co.il [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to test frequency independence (in a 2 by 2 table) withmany missing values
(and why it won't work)*: Naturally we would turn to performing a mcnemar test. But here is the twist: we have missing values in our data. For our purpose, let's assume the missingnes is completely at random, and we also have no data to use for imputation. Also, we have much more missing data for experiment number 2 then in experiment number 1. So the question is, under these settings, how do we test for experiment effect on the outcome? So far I have thought of two options: 1) To perform the test only for subjects that has both values. But they are too scarce and will yield low power. 2) To treat the data as independent and do a pearson's chi square test (well, an exact fisher test that is) on all the non-missing data that we have. The problem with this is that our data is not fully independent (which is a prerequisite to chi test, if I understood it correctly). So I am not sure if that is a valid procedure or not. Any insights will be warmly welcomed. p.s: here is an example code producing this scenario. set.seed(102) x1 - rbinom(100, 1, .5) x2 - rbinom(100, 1, .3) X - data.frame(x1,x2) tX - table(X) margin.table(tX,1) margin.table(tX,2) mcnemar.test(tX) put.missings - function(x.vector, na.percent) { turn.na - rbinom(length(x.vector), 1, na.percent) x.vector[turn.na == 1] - NA return(x.vector) } x1.m - put.missings(x1, .3) x2.m - put.missings(x2, .6) tX.m - rbind(table(na.omit(x1.m)), table(na.omit(x2.m))) fisher.test(tX.m) With regards, Tal -- -- My contact information: Tal Galili Phone number: 972-50-3373767 FaceBook: Tal Galili My Blogs: http://www.r-statistics.com/ http://www.talgalili.com http://www.biostatistics.co.il [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Getting identify() to work with lattice::cloud()
I don't seem to be able to put any syntax into identify() that gets it to work with lattice cloud() graph: layout(1) require('lattice') cloud(g3 ~ g1 + g2, data=gapp, col = blue, xlab='G1 Score', ylab='G2 Score', zlab='G3 Score', main='3D Plot for Applicants') identify(gapp[,2:4], labels=gapp$ID) Here g1 is gapp[,2], g2 is gapp[,3] and g3 is gapp[,4]. gapp is a data frame. I've tried several variants in the first arguments of identify(), but no matter what I use, when I click on a point I get warning: no point within 0.25 inches. Does identify() work with cloud()? If so, what is the correct syntax. Thanks. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Comparing model fits for NLME when models are not nested
At 05:42 AM 6/12/2009, Lindsay Banin wrote: Hi there, I am looking to compare nonlinear mixed effects models that have different nonlinear functions (different types of growth curve)embedded. Most of the literature I can find focuses on comparing nested models with likelihood ratios and AIC. Is there a way to compare model fits when models are not nested, i.e. when the nonlinear functions are not the same? Transform back into original units, if necessary, and compare distributions of and statistics of residuals from fitted values in original units. This is not a significance-test, but instead a measure of the better approximation to the observed model. Types of measures: 1) rms residual, 2) max absolute residual, 3) mean absolute residual. In my opinion, models should be chosen based on the principles of causality (theory), degree of approximation and parsimony. None of these involve significance testing. Choosing models based upon significance testing (which merely identifies whether or not the experiment is large enough to distinguish an effect clearly) amounts to admitting intellectually that you have no subject matter expertise, and you must therefore fall back on the crumbs of significance testing to get glimmers of understanding of what's going on. (Much like stepwise regression techniques.) As an example, suppose you have two models, one with 5 parameters and one with only 1. The rms residual error for the two models are 0.50 and 0.53 respectively. You have a very large study, and all 4 additional parameters are significant at p = 0.01 or less. What should you do? What I would do is select the 1 parameter study as my baseline model. It will be easy to interpret physically, will generalize to other studies much better (stable), and is almost identical in degree of approximation as the 5 parameter model. I would be excited that a one parameter model could do this. The fact that the other 4 parameters have detectable effects at a very low level is not important for modeling the study, but may conceivably have some special significance on their own for future investigations. So not being able to do significance testing on non-nested models is not that big a loss, in my opinion. Such tests encourage wrong thinking, in my opinion. What I've expressed as an opinion here (which I am sure some will disagree with) is similar to the philosophy of choosing the number of principal components to use, or number of latent factors in factor analysis. What investigation do people ever do on the small eigenvalue principal components, even if their contributions are statistically significant? Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] time series, longitudinal data or trajectories
At 04:54 AM 6/6/2009, Christophe Genolini wrote: Thanks for yours answers. So if I understand: - Trajectories are continuous, the other are discrete. - The difference between time series and longitudinal is that time series are made at regular time whereas longitudinal are not ? - Repeated measures are over a short period of time. So if I measure the weight of my patient daily during one week, it will be repeated measure ; if I measure it once a week during one year, it will time series ; if I measure it once a week during one year but with some missing week, it will longitudinal data ? Well I guess it is not as simple at that, but is it the idea ? snip Not exactly. If you measure weight daily for a week, that is a time series (equally spaced time measurements over an arbitrary period) and repeated measurements (multiple measurements on the same subject, whether on time or at random or in some other way). If you measure weight weekly for each week in a year this is a time series (equally spaced time measurements) and would generally be called a longitudinal study (measurements over a lengthy enough time period that time-related changes are expected). Missing data are common in repeated measures or longitudinal studies. In longitudinal studies, an additional problem of dropouts is present, which may be correlated with the unobserved measurement (i.e., missing, not at random or non-ignorable, non-random). Also, the long time period of a longitudinal study may create issues of measurement bias (due to drift in technique or clinicians over time) and change in the subject baseline state. Time series is typically used in my experience for measurements that have a great degree of regularity (equally-spaced times, few or no missing data). Trajectory is a term for continuous time curve. Examples: Study to measure blood pressure measurement fluctuations: N subjects measured by M operators every 8 hr during a week. Note there is a general expectation of a constant mean value for each subject during the period, with probably short-time fluctuations. This would be called a repeated measure study, Although it could also be called a time series study, the expectation of no total time period effect and the possibility of missing measurements would argue against that term. On the other hand, if a posteriori there were little or no missing data, and regular time-dependent patterns were observed, its name might be shifted to a time series study. Cohort study to measure blood pressure changes over a ten year time frame for treated and untreated subjects: There will be significant amounts of missing data, dropouts from the study and a long time period of observation. This would almost universally be known as a longitudinal study. Controlled experiment to measure rates of gelation of batches of different gelatins: N lots of gelatin, each measured in solution at the same M time periods for viscosity. A continuous underlying viscosity vs. time curve is expected (the trajectory) for each lot. Time periods are equal, and there are few missing data. The goal is to compare gelation trajectories. This is a repeated measure study, and might more particularly be characterized as a time series study. When the subjects are living entities, usually the terms repeated measures or longitudinal are used. If measurements are taken at a single point in time, the term cross-sectional study is used. If there is a single response across time, the term time series is used. If there are multiple responses all measured at the same times for the subjects, the term panel data is used. For controlled experiments, the terms repeated measures and time series are common. Longitudinal could be used, but generally is not. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] time series, longitudinal data or trajectories
At 04:02 PM 6/5/2009, Christophe Genolini wrote: Hi the list Strictly speaking, this is not a R question, but I need the information for the creation of a package. My question is about vocabulary: What is the difference between time series, longitudinal data and trajectories? Sincerely Christophe Longitudinal data are measurements over long periods of time, often at irregular periods, but consistent across subjects. Repeated measures data are replicates at the same point in time, or over a short period of time (e.g., laboratory experiments). Time series typically have constant increments of time and typically a stochastic character, although this term might be considered all-encompassing for all measurements at different times. Trajectory implies a continuous curve in time, as opposed to discrete times. Trajectory also implies an underlying causal model, as it is a term from kinematics. I hope this helps. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Block factor as random or fixed effect?
At 05:49 PM 5/13/2009, Rob Knell wrote: People I apologise for asking a general stats question, but I'm at a bit of a loss as to what to do following some hostile referees' comments. If I have a fully randomised blocked design, with only three blocks, should I treat block as a random or fixed effect? I have read comments about not treating block as a random effect if the number of blocks is less than 6 or 7: is this right? Any advice much appreciated Rob Knell If you treat the variable as fixed effects, then inference will only apply to those particular choices of blocks. If you treat the variable as a random effect, you are probably going to estimate a variance for a population distribution plus a mean effect, so inference can be made to the population of all possible blocks. The rule you've probably seen quoted could be paraphrased to say: If you're trying to estimate a random effect (i.e., variance), you will need at least 6 subjects, or you won't get any precision on the estimate. For fewer than 6 subjects, you might as well give up on modeling a random effect, and just settle for doing the fixed effects model. That being said, if you really need inferences on the population of blocks, model the random effect and bite the bullet on the imprecision. Also, remember the assumption that the blocks are chosen randomly (from a normal distribution). If they're not, stick with the fixed effects model. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] p-values from bootstrap - what am I not understanding?
There is really nothing wrong with this approach, which differs primarily from the permutation test in that sampling is with replacement instead of without replacement (multinomial vs. multiple hypergeometric). One of the issues that permutation tests don't have is bias in the statistic. In order for bootstrap p-values to be reasonably accurate, you need a reasonable dataset size, so that sampling with replacement isn't a big effect, and so that enough patterns arise in resampling. It also helps if the data is continuous instead of categorical or binary. The same issues affect permutation tests, but untroubled by bias. The usual methods for p-values (e.g., see Fisher's test in Agresti's Categorical Analysis) work here. Typically there is some ambiguity on how to treat the values equal to the observed statistic. If you include it, the p-value is conservative for rejection. If you don't, it's liberal for rejection. If you include 1/2 weight, it averages correctly in the long run. Ditto for 2-tailed p-values vs. single tails. Several different methods (some of which you listed) are used. As a general rule, if you have data from which you wish a p-value, a permutation (i.e., without replacement) test is used, but for confidence intervals, bootstrapping (i.e., with replacement) is used. For reasonably large datasets, both methods will agree closely. But permutation tests are typically used for smaller size datasets. (Think binomial vs. hypgeometric distributions for p-values, and when they agree.) At 05:47 PM 4/12/2009, Johan Jackson wrote: Dear stats experts: Me and my little brain must be missing something regarding bootstrapping. I understand how to get a 95%CI and how to hypothesis test using bootstrapping (e.g., reject or not the null). However, I'd also like to get a p-value from it, and to me this seems simple, but it seems no-one does what I would like to do to get a p-value, which suggests I'm not understanding something. Rather, it seems that when people want a p-value using resampling methods, they immediately jump to permutation testing (e.g., destroying dependencies so as to create a null distribution). SO - here's my thought on getting a p-value by bootstrapping. Could someone tell me what is wrong with my approach? Thanks: STEPS TO GETTING P-VALUES FROM BOOTSTRAPPING - PROBABLY WRONG: 1) sample B times with replacement, figure out theta* (your statistic of interest). B is large ( 1000) 2) get the distribution of theta* 3) the mean of theta* is generally near your observed theta. In the same way that we use non-centrality parameters in other situations, move the distribution of theta* such that the distribution is centered around the value corresponding to your null hypothesis (e.g., make the distribution have a mean theta = 0) 4) Two methods for finding 2-tailed p-values (assuming here that your observed theta is above the null value): Method 1: find the percent of recentered theta*'s that are above your observed theta. p-value = 2 * this percent Method 2: find the percent of recentered theta*'s that are above the absolute value of your observed value. This is your p-value. So this seems simple. But I can't find people discussing this. So I'm thinking I'm wrong. Could someone explain where I've gone wrong? J Jackson [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Physical Units in Calculations
At 08:00 PM 4/12/2009, Tom La Bone wrote: Back in 2005 I had been doing most of my work in Mathcad for over 10 years. For a number of reasons I decided to switch over to R. After much effort and help from you folks I am finally thinking in R rather than thinking in Mathcad and trying to translating to R. Anyway, the only task I still use Mathcad for is calculations that involve physical quantities and units. For example, in Mathcad I can add 1 kilometer to 1 mile and get the right answer in the units of length I choose. Likewise, if I try to add 1 kilometer to 1 kilogram I get properly chastised. Is there a way in R to assign quantities and units to numbers and have R keep track of them like Mathcad does? Yes, but it's a lot of work: Create objects with units as an attribute. Perhaps someone else can tell you if such a set of definitions already exists. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Genstat into R - Randomisation test
At 04:43 AM 4/9/2009, Tom Backer Johnsen wrote: Peter Dalgaard wrote: Mike Lawrence wrote: Looks like that code implements a non-exhaustive variant of the randomization test, sometimes called a permutation test. Isn't it the other way around? (Permutation tests can be exhaustive by looking at all permutations, if a randomization test did that, then it wouldn't be random.) Eugene Edgington wrote an early book (1980) on this subject called Randomization tests, published by Dekker. As far as I remember, he differentiates between Systematic permutation tests where one looks at all possible permutations. This is of course prohibitive for anything beyond trivially small samples. For larger samples he uses what he calls Random permutations, where a random sample of the possible permutations is used. Tom Peter Dalgaard wrote: Mike Lawrence wrote: Looks like that code implements a non-exhaustive variant of the randomization test, sometimes called a permutation test. Isn't it the other way around? (Permutation tests can be exhaustive by looking at all permutations, if a randomization test did that, then it wouldn't be random.) Edginton and Onghena make a similar distinction in their book, but I think such a distinction is without merit. Do we distinguish between exact definite integrals and approximate ones obtained by numerical integration, of which Monte Carlo sampling is just one class of algorithms? Don't we just say: The integral was evaluated numerically by the [whatever] method to an accuracy of [whatever], and the value was found to be [whatever]. Ditto for optimization problems. A randomization test has one correct answer based upon theory. We are simply trying to calculate that answer's value when it is difficult to do so. Any approximate method that is used must be performed such that the error of approximation is trivial with respect to the contemplated use. Doing Monte Carlo sampling to find an approximate answer to a randomization test, or to an optimization problem, or to a bootstrap distribution should be carried out with enough realizations to make sure the residual error is trivial. As Monte Carlo sampling is a random sampling-based approximate method. The name does create confusion in terminology for randomization tests for bootstrapping. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] course in ecological data analysis
At 05:27 AM 4/3/2009, Capelle, Jacob wrote: Dear all, For my PhD study I'm looking for relevant courses/workshops (short term) in ecological data anlysis with R in Europe. After 2 days searching I'm convinced that google is probably not the right medium to find this information. If anyone can help me I will be most grateful. Best regards - J. Capelle Try http://www.statistics.com/ Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] 'stretching' a binomial variable
At 06:49 AM 3/27/2009, imicola wrote: Hi, Im carrying out some Bayesian analysis using a binomial response variable (proportion: 0 to 1), but most of my observations have a value of 0 and many have very small values (i.e. 0.001). I'm having troubles getting my MCMC algorithm to converge, so I have decided to try normalising my response variable to see if this helps. I want it to stay between 0 and 1 but to have a larger range of values, or just for them all to be slightly higher. Does anyone know the best way to acheive this? I could just add a value to each observation (say 10 to increase the proportion a bit, but ensuring it would still be between 0 and 1) - would that be ok? Or is there a better way to stretch the values up? Sorry - i know its not really an R specific question, but I have never found a forum with as many stats litterate people as this one :-) Cheers - any advice much appreciated! nicola Work with events instead of proportions, and use a Poisson model. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Logistic regression problem
It would not be possible to answer your original question until you specify your goal. Is it to develop a model with external validity that will generalize to new data? (You are not likely to succeed, if you are starting with a boil the ocean approach with 44,000+ covariates and millions of records.) This is the point Prof. Harrell is making. Or is it to reduce a large dataset to a tractable predictor formula that only interpolates your dataset? If the former, you will need external modeling information to select the wheat from the chaff in your excessive predictor set. Assuming it is the latter, then almost any approach that ends up with a tractable model (that has no meaning other than interpolation of this specific dataset) will be useful. For this, regression trees or even stepwise regression would work. The algorithm must be very simple and computer efficient. This is the area of data mining approaches. I would suggest you start by looking at covariate patterns to find out where the scarcity lies. These will end up high leverage data. Another place to start is common sense: Thousands of covariates cannot all contain independent information of value. Try to cluster them and pick the best representative from each cluster based on expert knowledge. You may solve your problem quickly that way. At 05:34 AM 10/1/2008, Bernardo Rangel Tura wrote: Em Ter, 2008-09-30 às 18:56 -0500, Frank E Harrell Jr escreveu: Bernardo Rangel Tura wrote: Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu: I have a huge data set with thousands of variable and one binary variable. I know that most of the variables are correlated and are not good predictors... but... It is very hard to start modeling with such a huge dataset. What would be your suggestion. How to make a first cut... how to eliminate most of the variables but not to ignore potential interactions... for example, maybe variable A is not good predictor and variable B is not good predictor either, but maybe A and B together are good predictor... Any suggestion is welcomed milicic.marko I think do you start with a rpart(binary variable~.) This show you a set of variables to start a model and the start set to curoff for continous variables I cannot imagine a worse way to formulate a regression model. Reasons include 1. Results of recursive partitioning are not trustworthy unless the sample size exceeds 50,000 or the signal to noise ratio is extremely high. 2. The type I error of tests from the final regression model will be extraordinarily inflated. 3. False interactions will appear in the model. 4. The cutoffs so chosen will not replicate and in effect assume that covariate effects are discontinuous and piecewise flat. The use of cutoffs results in a huge loss of information and power and makes the analysis arbitrary and impossible to interpret (e.g., a high covariate value:low covariate value odds ratio or mean difference is a complex function of all the covariate values in the sample). 5. The model will not validate in new data. Professor Frank, Thank you for your explain. Well, if my first idea is wrong what is your opinion on the following approach? 1- Make PCA with data excluding the binary variable 2- Put de principal components in logistic model 3- After revert principal componentes in variable (only if is interesting for milicic.marko) If this approach is wrong too what is your approach? -- Bernardo Rangel Tura, M.D,MPH,Ph.D National Institute of Cardiology Brazil __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Located Latent Class Analysis (Uebersax)
Ubersax includes an .exe file in his zip package. This should solve your problem, if you use Windows. If you use linux, you should be able to easily find a free FORTRAN compiler. At 07:24 AM 9/29/2008, Karen Chan wrote: Dear list members I am new to the list and would be much appreciated if you could help me. I am very interested in applying Latent Class Model for analysing multiple raters agreement data. I have recently found a paper by Uebersax, J. (1993) which described a method in a form of Located Latent Class Analysis (llca). Uebersax has written a Fortran program which is available on the web, for the method described in the paper. I do not have a Fortran compiler on my PC. I know there are many R packages do latent class type models. I am wondering if anyone has implemented llca, or equivalent in R. Thank you in advance for your help -- Karen Chan DHSRU University of Dundee The University of Dundee is a Scottish Registered Charity, No. SC015096. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] FW: logistic regression
that the FDA insists on clinical trials, after all (and reasons why such studies are difficult and expensive to do!). The belief that data mining (as it is known in the polite circles that Frank obviously eschews) is an effective (and even automated!) tool for discovering how Nature works is a misconception, but one that for many reasons is enthusiastically promoted. If you are looking only to predict, it may do; but you are deceived if you hope for Truth. Can you get hints? -- well maybe, maybe not. Chaos beckons. I think many -- maybe even most -- statisticians rue the day that stepwise regression was invented and certainly that it has been marketed as a tool for winnowing out the important few variables from the blizzard of irrelevant background noise. Pogo was right: We have seen the enemy -- and it is us. (As I said, the Inferno awaits...) Cheers to all, Bert Gunter DEFINITELY MY OWN OPINIONS HERE! -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of David Winsemius Sent: Saturday, September 27, 2008 5:34 PM To: Darin Brooks Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [R] FW: logistic regression It's more a statement that it expresses a statistical perspective very succinctly, somewhat like a Zen koan. Frank's book,Regression Modeling Strategies, has entire chapters on reasoned approaches to your question. His website also has quite a bit of material free for the taking. -- David Winsemius Heritage Laboratories On Sep 27, 2008, at 7:24 PM, Darin Brooks wrote: Glad you were amused. I assume that booking this as a fortune means that this was an idiotic way to model the data? MARS? Boosted Regression Trees? Any of these a better choice to extract significant predictors (from a list of about 44) for a measured dependent variable? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] ] On Behalf Of Ted Harding Sent: Saturday, September 27, 2008 4:30 PM To: [EMAIL PROTECTED] Subject: Re: [R] FW: logistic regression On 27-Sep-08 21:45:23, Dieter Menne wrote: Frank E Harrell Jr f.harrell at vanderbilt.edu writes: Estimates from this model (and especially standard errors and P-values) will be invalid because they do not take into account the stepwise procedure above that was used to torture the data until they confessed. Frank Please book this as a fortune. Dieter Seconded! Ted. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University No virus found in this incoming message. Checked by AVG - http://www.avg.com 1:11 PM __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R function which finds confidence interval for binomial variance
Thanks Ralph, Moshe and [EMAIL PROTECTED] for you helpful comments. Using bootstrap (e.g., 'boot' + boot.ci()) for the confidence interval on the variance is not very accurate in coverage, because the sampling distribution is extremely skewed. In fact, the 'BCa' method returns the same result as the Efron 'percent' method. Moshe's idea of treating the confidence interval for the binomial variance as a transform of the confidence interval for the binomial proportion is elegant (Doh! Why didn't I think of that?), except that the transform is bivalued, although monotonic on each branch, with the branch point singularity at p=0.5. The bootstrap method does not have much coverage accuracy for any proportion, for n=6, 12 and 20, and the proportion method works great for n=6, 12, 20 and 50, except near p = 0.5, where it fails to achieve reasonable coverage. So I'm still looking for a reliable method for all p and for reasonable n. The proportion-based method is the best I've found, so far. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Confidence interval for binomial variance
Based on simulations, I've come up with a simple function to compute the confidence interval for the variance of the binomial variance, where the true variance is v = rho*(1-rho)/n where rho = true probability of success and n = # of trials. For x = # successes observed in n trials, p = x / n as usual. For p 0.25 or p 0.75, I use the proportion-based transformed confidence interval, as suggested by Moshe Olshansky. I used the Wilson interval, but some might prefer the Blaker interval. For p = 0.25 or p = 0.75, I use the standard chi-square based confidence interval (which is very conservative for this problem in this range). The composite method gives reliable results over a wide range of rho. This should suit the purpose until someone comes up with a more theoretically sound method. Bootstrap is not reliable for n 50, at least. #09.26.08 02.50 binomVarCI.r #copyright 2008 by Robert A LaBudde, all rights reserved #CI for binomial sample variance #created: 09.26.08 by r.a. labudde #changes: require('binGroup') binomVarCI- function (n, x, conf=0.95) { p- x/n #proportion if (p0.25 | p0.75 | x==0 | x==n) { #use proportion-based CI pCI- binWilson(n, x, conf.level=conf) #CI for proportion vCI- sort(c(pCI[1]*(1-pCI[1])/(n-1), pCI[2]*(1-pCI[2])/(n-1))) } else { #use chi-square-based CI phiL- qchisq(0.025, n-1)/(n-1) phiU- qchisq(0.975, n-1)/(n-1) #vest- p*(1-p)/(n-1)) #variance estimate vCI- c(vest/phiU, vest/phiL) #chi-square-based } return (vCI) } Here is a test program to measure coverage: #09.26.08 03.10 tbinomVarCI.r #copyright 2008 by Robert A LaBudde, all rights reserved #test of CI for binomial sample variance #created: 09.26.08 by r.a. labudde #changes: nReal - 1000 for (POD in c(0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.50)) { cat('\nPOD: ', sprintf('%8.4f', POD), '\n') for (nRepl in c(6, 12, 20, 50)) { vtrue- POD*(1-POD)/nRepl pcover- 0 for (iReal in 1:nReal) { x- rbinom(1, nRepl, POD) vCI- binomVarCI(nRepl, x) #vest- (x/nRepl)*(1 - x/nRepl)/(nRepl-1) if (vtrue = vCI[1] vtrue= vCI[2]) pcover- pcover + 1 } #iReal pcover- pcover/nReal cat('n: ', sprintf('%4i', nRepl), ' Coverage: ', sprintf('%8.4f', pcover), '\n') } #nRepl } #POD Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R function which finds confidence interval for binomial variance
I need to construct confidence intervals for the binomial variance. This is the usual estimate v = x*(n-x)/n or its unbiased counterpart v' = x*(n-x)/(n-1) where x = binomial number of successes observed in n Bernoulli trials from proportion p. The usual X^2 method for variance confidence intervals will not work, because of the strong non-normal character of the sampling distribution for v (or v'). Does anyone know of an R package with R function that computes a reasonable confidence interval for v or v'? Thanks. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Averaging 'blocks' of data
I'm not sure I exactly understand your problem, but if you are looking for a recursive algorithm for calculating the average by addition of one record only at a time, consider: y[k] = y[k-1] + (x[k] - y[k-1])/k, where y(0) = 0, k = 1, 2, ... At each stage, y[k] = (x[1]+...+x[k])/k. At 04:46 PM 9/7/2008, Steve Murray wrote: Gabor - thanks for your suggestion... I had checked the previous post, but I found (as a new user of R) this approach to be too complicated and I had problems gaining the correct output values. If there is a simpler way of doing this, then please feel free to let me know. Dylan - thanks, your approach is a good start. In answer to your questions, my data are 43200 columns and 16800 rows as a data frame - I will probably have to read the dataset in segments though, as it won't fit into the memory! I've been able to follow your example - how would I be able to apply this technique for finding the average of each 60 x 60 block? Any other suggestions are of course welcome! Many thanks again, Steve __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Projecting Survival Curve into the Future
You might consider a probit analysis using ln(Time) as the dose. At 09:24 AM 9/4/2008, [EMAIL PROTECTED] wrote: I have a survivor curve that shows account cancellations during the past 3.5 months. Â Fortunately for our business, but unfortunately for my analysis, the survivor curve doesn't yet pass through 50%. Â Is there a safe way to extend the survivor curve and estimate at what time we'll hit 50%? Without any example code it's hard to say, but take a look at ?predict.coxph and ?predict.survreg in the survival package. Regards, Richie. Mathematical Sciences Unit HSL Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Upgrading R means I lose my packages
At 04:12 AM 8/28/2008, ONKELINX, Thierry wrote: On a windows machine you get the same problem. Useless one uses tha same trick as Rolf suggested: don't install the packages in the default directory and set R_LIBS to that directory. Then all you need to do after an upgrade is to set R_LIBS in the new version and run update.package(checkBuilt = TRUE). Given Rolf's suggestion I suppose this trick will work on a Mac too. What I do in installing a new version of R on a Windows system is as follows: 1. In the c:\Program Files\R folder, the installation is in a subfolder labeled by the version, such as R-2.7.1. 2. I leave the old (say, 2.7.1) version installed, and install the new version (say, 2.7.2). This leaves the old subfolder R-2.7.1 intact, and creates a new one R-2.7.2. 3. I use a file-compare utility (in my case, Beyond Compare, which I recommend), to compare the subfolders C:\Program Files\R\R-2.7.1\library and C:\Program Files\R\R-2.7.2\library. I set the comparison to find files present or newer in the 2.7.1 folder vs. the 2.7.2. Then I copy all such files over. 4. At this point, the 2.7.2 has the same or new packages than 2.7.1, most or all of which will work. 5. I use the Packages|Update Package ... to update packages to 2.7.2. 6. Then I delete the 2.7.1 subfolder. You need Administrator rights to do this. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Specifying random effects distribution in glmer()
I'm trying to figure out how to carry out a Poisson regression fit to longitudinal data with a gamma distribution with unknown shape and scale parameters. I've tried the 'lmer4' package's glmer() function, which fits the Poisson regression using: library('lme4') fit5- glmer(seizures ~ time + progabide + timeXprog + offset(lnPeriod) + (1|id), data=pdata, nAGQ=1, family=poisson) #note: can't use nAGQ1, not yet implemented summary(fit5) Here 'seizures' is a count and 'id' is the subject number. This fit works, but uses the Poisson distribution with the gamma heterogeneity. Based on the example in the help for glmer(), I tried fit6- glmer(seizures ~ time + progabide + timeXprog + offset(lnPeriod) + (1|pgamma(id, shap, scal)), data=pdata, nAGQ=1, start=c(shap=1, scal=1), family=poisson) #note: can't use nAGQ1, not yet implemented summary(fit6) but this ends up with Error in pgamma(id, shap, scal) : object shap not found. My questions are: 1. Can this be done? 2. Am I using the right package and function? 3. What am I doing wrong? Any help would be appreciated. Thanks. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Null and Alternate hypothesis for Significance test
for Significance test To: r-help@r-project.org Received: Friday, 22 August, 2008, 6:58 AM Hi, I had a question about specifying the Null hypothesis in a significance test. Advance apologies if this has already been asked previously or is a naive question. I have two samples A and B, and I want to test whether A and B come from the same distribution. The default Null hypothesis would be H0: A=B But since I am trying to prove that A and B indeed come from the same distribution, I think this is not the right choice for the null hypothesis (it should be one that is set up to be rejected) How do I specify a null hypothesis H0: A not equal to B for say a KS test. An example to do this in R would be greatly appreciated. On a related note: what is a good way to measure the difference between observed and expected PDFs? Is the D statistic of the KS test a good choice? Thanks! Nitin [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] shapiro wilk normality test
At 11:30 AM 7/12/2008, Bunny, lautloscrew.com wrote: Hi everybody, somehow i dont get the shapiro wilk test for normality. i just can´t find what the H0 is . i tried : shapiro.test(rnorm(5000)) Shapiro-Wilk normality test data: rnorm(5000) W = 0.9997, p-value = 0.6205 If normality is the H0, the test says it´s probably not normal, doesn ´t it ? 5000 is the biggest n allowed by the test... are there any other test ? ( i know qqnorm already ;) thanks in advance matthias Yes, H0 is normality. The P-value, as for other statistical tests, measures the probability that this sample could have arisen from the population under H0. 0.62 is a probability very compatible with H0. The typical rejection criterion would be a P-value 0.05, which is not the case here. The limitation to n = 5000 is not serious, as even a few hundred data should take you to the asymptotic region. Use sample() to select the data at random from within your data set to avoid bias in using the test. E.g., shapiro.test(sample(mydata, 1000, replace=TRUE)) Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] shapiro wilk normality test
At 12:48 PM 7/12/2008, Bunny, lautloscrew.com wrote: first of all thanks yall. it´s always good to get it from people that know for sure. my bad, i meant to say it´s compatible with normality. i just wanted to know if it wouldnt be better to test for non-normality in order to know for sure. and if so, how can i do it? Doing a significance test may seem complicated, but it's an almost trivial concept. You assume some null hypothesis that specifies a unique distribution that you can use to calculate probabilities from. Then use this distribution to calculate the probability of finding what you found in your data, or more extreme. This is the P-value of the test. It is the probability of finding what you found, given that the null hypothesis is true. You give up (reject) the null hypothesis if this P-value is too unbelievably small. The conventional measure for ordinary, repeatable experiments is 0.05. Sometimes a smaller value like 0.01 is more reasonable. Doing what has been suggested, i.e., using a null hypothesis of nonnormality, is unworkable. There are uncountably infinite ways to specify a nonnormal distribution. Is it discrete or continuous? Is it skewed or symmetric? Does it go from zero to infinity, from 0 to 1, from -infinity to infinity, or anything else? Does it have one mode or many? Is it continuous or differentiable? Etc. In order to do a statistical test, you must be able to calculate the P-value. That usually means your null hypothesis must specify a single, unique probability distribution. So nonnormal in testing means reject normal as the distribution. Nonnormal is not defined other than it's not the normal distribution. If you wish to test how the distribution is nonnormal, within some family of nonnormal distributions, you will have to specify such a null hypothesis and test for deviation from it. E.g., testing for coefficient of skewness = 0. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Is there a good package for multiple imputation of missing values in R?
I'm looking for a package that has a start-of-the-art method of imputation of missing values in a data frame with both continuous and factor columns. I've found transcan() in 'Hmisc', which appears to be possibly suited to my needs, but I haven't been able to figure out how to get a new data frame with the imputed values replaced (I don't have Herrell's book). Any pointers would be appreciated. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Is there a good package for multiple imputation of missing values in R?
At 03:02 AM 6/30/2008, Robert A. LaBudde wrote: I'm looking for a package that has a start-of-the-art method of imputation of missing values in a data frame with both continuous and factor columns. I've found transcan() in 'Hmisc', which appears to be possibly suited to my needs, but I haven't been able to figure out how to get a new data frame with the imputed values replaced (I don't have Herrell's book). Any pointers would be appreciated. Thanks to paulandpen, Frank and Shige for suggestions. I looked at the packages 'Hmisc', 'mice', 'Amelia' and 'norm'. I still haven't mastered the methodology for using aregImpute() in 'Hmisc' based on the help information. I think I'll have to get hold of Frank's book to see how it's used in a complete example. 'Amelia' and 'norm' appear to be focused solely on continuous, multivariate normal variables, but my needs typically involve datasets with both factors and continuous variables. The function mice() in 'mice' appears to best suit my needs, and the help file was intelligible, and it works on both factors and continuous variables. For those in the audience with similar issues, here is a code snippet showing how some of these functions work ('felon' is a data frame with categorical and continuous predictors of the binary variable 'hired'): library('mice') #missing data imputation library for md.pattern(), mice(), complete() names(felon) #show variable names md.pattern(felon[,1:4]) #show patterns for missing data in 1st 4 vars library('Hmisc') #package for na.pattern() and impute() na.pattern(felon[,1:4]) #show patterns for missing data in 1st 4 vars #simple imputation can be done by felon2- felon #make copy felon2$felony- impute(felon2$felony) #impute NAs (most frequent) felon2$gender- impute(felon2$gender) #impute NAs felon2$natamer- impute(felon2$natamer) #impute NAs na.pattern(felon2[,1:4]) #show no NAs left in these vars fit2- glm(hired ~ felony + gender + natamer, data=felon2, family=binomial) summary(fit2) #better, multiple imputation can be done via mice(): imp- mice(felon[,1:4]) #do multiple imputation (default is 5 realizations) for (iSet in 1:5) { #show results for the 5 imputation datasets fit- glm(hired ~ felony + gender + natamer, data=complete(imp, iSet), family=binomial) #fit to iSet-th realization print(summary(fit)) } Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to make R running faster
At 10:25 AM 5/28/2008, Esmail Bonakdarian wrote: Erin Hodgess wrote: I remember reading the colSum and colMean were better, when you need sums and means Well .. I'm waiting for the experts to jump in and give us the straight story on this :-) All of the algorithms are represented internally by sequential program logic using C or Fortran, for example. So the issue isn't the algorithm itself. Instead, it's where the algorithm is implemented. However, R is an interpreter, not a compiler. This means that it reads each line of R code one character at a time to develop an understanding of what is desired done, and to check for errors in syntax and data classes. Interpreters are very slow compared to compiled code, where the lines have been pre-interpreted and already converted to machine code with error checking resolved. For example a simple for loop iteration might take only 0.1 microsecond in a compiled program, but 20-100 microseconds in an interpreted program. This overhead of parsing each line can be bounded by function calls inside each line. If the compiled function executes on a large number of cases in one call, then the 50 microsecond overhead per call is diluted out. R is a parallel processing language. If you use vectors and arrays and the built-in (i.e., compiled) function calls, you get maximum use of the compiled programs and minimum use of the interpreted program. This is why functions such as colMeans() or apply() are faster than writing direct loops in R. You can speed things up by 200-1000x for large arrays. Interpreted languages are very convenient to use, as they do instant error checking and are very interactive. No overhead of learning and using compilers and linkers. But they are very slow on complex calculations. This is why the array processing is stuffed into compiled functions. The best of both worlds then. Interpreted languages are Java, R, MatLab, Gauss and others. Compiled languages are C and Fortran. Some, like variants of BASIC, can be interpreted, line-compiled or compiled, depending upon implementation. Some compiled languages (such as Fortran), can allow parallel processing via multiprocessing on multiple CPUs, which speeds things up even more. Compiled languages also typically optimize code for the target machine, which can speed things up a factor of 2 or so. So the general rule for R is: If you are annoyed at processing time, alter your program to maximize calculations within compiled functions (i.e., vectorize your program to process an entire array at one time) and minimize the number of lines of R. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Solving 100th order equation
At 07:58 AM 5/24/2008, Duncan Murdoch wrote: Shubha Vishwanath Karanth wrote: To apply uniroot I don't even know the interval values... Does numerical methods help me? Or any other method? I forgot: we also have polyroot(). Duncan Murdoch Thanks and Regards, Shubha -Original Message- From: Duncan Murdoch [mailto:[EMAIL PROTECTED] Sent: Saturday, May 24, 2008 5:08 PM To: Shubha Vishwanath Karanth Subject: Re: [R] Solving 100th order equation Shubha Vishwanath Karanth wrote: Hi R, I have a 100th order equation for which I need to solve the value for x. Is there a package to do this? snip Finding all of the roots of a 100-th order polynomial on a computer is not an easy task. It would take move than 100 digits of precision to do so. A similar problem to finding all 100 eigenvalues of a 100x100 ill-conditioned matrix. The suggestion to use graphics to find small intervals localizing the roots of interest first would make the most sense, at least for the real roots. It is highly unlikely that all 100 roots are needed. Usually the ones of interest are the one with the largest real part, or the ones inside the unit circle, or the real roots, etc. Is this a homework problem? Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Simulation study in R
At 02:40 AM 4/29/2008, Arun Kumar Saha wrote: Here I am in a simulation study where I want to find different values of x and y such that f(x,y)=c (some known constant) w.r.t. x, y 0, y=x and x=c1 (another known constant). Can anyone please tell me how to do it efficiently in R. One way I thought that I will draw different random numbers from uniform dist according to that constraints and pick those which satisfy f(x,y)=c. However it is not I think computationally efficient. Can anyone here suggest me any other efficient approach? You have not specified the distributions proper for X and Y. Using a uniform distribution is only appropriate when it meets requirements. One obvious approach is to sample one of the variables, say X, and then solve your equation for Y. If you're going to draw a lot of samples, it would pay to develop y = g(x) first. But you need to know how to sample X in the first place. Is its distribution uniform, or something else? Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] quasi-random sequences
You seem to have ambiguous requirements. First, you want equidistribution for a packing structure, which would suggest closest packing or quasirandom sequences, as you have tried. But then you are disturbed by the packing structure, because it gives a perceivable pattern, so you wish to randomize it, but under some unspecified condition of equidistribution (your electrostatic repulsion algorithm). You obviously have some constraints on selection you have not quantified yet. E.g., circles of unspecified radii cannot overlap. You should also realize that any distribution of centers under such constraints will always exhibit structure due to the constraints. Is your problem simply to give an appearance of randomness to the casual observer, or something more definite? You also need to say something about the packing density involved. On the face of it, you with At 06:22 AM 4/26/2008, baptiste Auguié wrote: Dear list useRs, I have to generate a random set of coordinates (x,y) in [-1 ; 1]^2 for say, N points. At each of these points is drawn a circle (later on, an ellipse) of random size, as in: N - 100 positions - matrix(rnorm(2 * N, mean = 0 , sd= 0.5), nrow=N) sizes-rnorm(N, mean = 0 , sd= 1) plot(positions,type=p,cex=sizes) My problem is to avoid collisions (overlap, really) between the points. I would like some random pattern, but with a minimum exclusion distance. In looking up Numerical recipes in C, I found out about some Sobol quasi-random sequences, which one can call from the gsl package, library(gsl) g - qrng_alloc(type=sobol,dim=2) qrng_get(g,n= N) -xy plot((xy),t=p,cex=0.5) but this does not look very random: I clearly see some pattern (diagonals, etc...), and even the non-overlapping condition is not impressive. One (painful) way I can foresee is to check the distance between each symbol and the others, and move the overlapping ones in a recursive manner. Before delving into this, I wanted to check I'm not overlooking something in the rgl quasi-random sequences, or missing a more obvious way to generate such patterns. Perhaps solving an electrostatic problem with a potential both attractive at long distances and repulsive at short distances is a better way? I have a vague recollection of hearing that somewhere to position points evenly on a sphere. Thanks for any comment / suggestion, Baptiste _ Baptiste Auguié Physics Department University of Exeter Stocker Road, Exeter, Devon, EX4 4QL, UK Phone: +44 1392 264187 http://newton.ex.ac.uk/research/emag http://projects.ex.ac.uk/atto __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] bootstrap for confidence intervals of the mean
See the help for boot(). The function in the 2nd argument has to be of a special form. You need to define such a form, as in: fmean- function (x, i) mean(x[i]) #use data x[] and indices i and then boot.out- boot(d, fmean, R=1000, sim='permutation') At 12:59 PM 4/22/2008, stephen sefick wrote: d = c(0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 0L, 0L, 7375L, NA, NA, 17092L, 0L, 0L, 32390L, 2326L, 22672L, 13550L, 18285L) boot.out -boot(d, mean, R=1000, sim=permutation) Error in mean.default(data, original, ...) : 'trim' must be numeric of length one I know that I am missing something but I can't figure it out. thanks stephen -- Let's not spend our time and resources thinking about things that are so little or so large that all they really do for us is puff us up and make us feel like gods. We are mammals, and have not exhausted the annoying little problems of being mammals. -K. Mullis [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Multidimensional contingency tables
How does one ideally handle and display multidimenstional contingency tables in R v. 2.6.2? E.g.: prob1- data.frame(victim=c(rep('white',4),rep('black',4)), + perp=c(rep('white',2),rep('black',2),rep('white',2),rep('black',2)), + death=rep(c('yes','no'),4), count=c(19,132,11,52,0,9,6,97)) prob1 victim perp death count 1 white white yes19 2 white whiteno 132 3 white black yes11 4 white blackno52 5 black white yes 0 6 black whiteno 9 7 black black yes 6 8 black blackno97 The xtabs() function doesn't seem appropriate, as it has no means of using 'count'. This must be a common problem. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Multidimensional contingency tables
Now that is simple and elegant. Thanks! PS. Is there a course available for learning how to read R help information? :) At 10:52 PM 4/21/2008, Gabor Grothendieck wrote: xtabs(count ~., prob1) On Mon, Apr 21, 2008 at 10:46 PM, Robert A. LaBudde [EMAIL PROTECTED] wrote: How does one ideally handle and display multidimenstional contingency tables in R v. 2.6.2? E.g.: prob1- data.frame(victim=c(rep('white',4),rep('black',4)), + perp=c(rep('white',2),rep('black',2),rep('white',2),rep('black',2)), + death=rep(c('yes','no'),4), count=c(19,132,11,52,0,9,6,97)) prob1 victim perp death count 1 white white yes19 2 white whiteno 132 3 white black yes11 4 white blackno52 5 black white yes 0 6 black whiteno 9 7 black black yes 6 8 black blackno97 The xtabs() function doesn't seem appropriate, as it has no means of using 'count'. This must be a common problem. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Problem installing and using package tseries
I have R 2.6.2, and have tried downloading and installing the package tseries. I get the same error when I use two different mirror sites: utils:::menuInstallPkgs() trying URL 'http://cran.mirrors.hoobly.com/bin/windows/contrib/2.6/tseries_0.10-14.zip' Content type 'application/zip' length 400799 bytes (391 Kb) opened URL downloaded 391 Kb package 'tseries' successfully unpacked and MD5 sums checked The downloaded packages are in C:\Documents and Settings\User\Local Settings\Temp\Rtmpk2jrvn\downloaded_packages updating HTML package descriptions library('tseries') Error in dyn.load(file, ...) : unable to load shared library 'C:/PROGRA~1/R/R-26~1.2/library/tseries/libs/tseries.dll': LoadLibrary failure: The specified module could not be found. Error: package/namespace load failed for 'tseries' There doesn't seem to be anything unusual. The package is installed like any other, and the required file is at C:\Program Files\R\R-2.6.2\library\tseries\libs\tseries.dll. It is 36,864 bytes and looks superficially okay. Any ideas as to what's wrong here? Thanks. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Problem installing and using package tseries
Thanks, Uwe. Note that another posting today had the same type of problem: require(Matrix) Loading required package: Matrix Error in dyn.load(file, ...) : unable to load shared library 'C:/PROGRA~1/R/R-26~1.2/library/Matrix/libs/Matrix.dll': LoadLibrary failure: The specified module could not be found. In both cases, R is looking to the wrong folder apparently. At 04:54 PM 4/10/2008, Uwe Ligges wrote: Can you please try to reinstall in roughly 12 hours from now from CRAN master? I have fixed the problem and recompiled tseries which will appear on CRAN master within 12 hours. Best wishes, Uwe Ligges Robert A. LaBudde wrote: I have R 2.6.2, and have tried downloading and installing the package tseries. I get the same error when I use two different mirror sites: utils:::menuInstallPkgs() trying URL 'http://cran.mirrors.hoobly.com/bin/windows/contrib/2.6/tseries_0.10-14.zip' Content type 'application/zip' length 400799 bytes (391 Kb) opened URL downloaded 391 Kb package 'tseries' successfully unpacked and MD5 sums checked The downloaded packages are in C:\Documents and Settings\User\Local Settings\Temp\Rtmpk2jrvn\downloaded_packages updating HTML package descriptions library('tseries') Error in dyn.load(file, ...) : unable to load shared library 'C:/PROGRA~1/R/R-26~1.2/library/tseries/libs/tseries.dll': LoadLibrary failure: The specified module could not be found. Error: package/namespace load failed for 'tseries' There doesn't seem to be anything unusual. The package is installed like any other, and the required file is at C:\Program Files\R\R-2.6.2\library\tseries\libs\tseries.dll. It is 36,864 bytes and looks superficially okay. Any ideas as to what's wrong here? Thanks. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] generate random numbers subject to constraints
At 05:06 PM 3/26/2008, Ted Harding wrote: On 26-Mar-08 21:26:59, Ala' Jaouni wrote: X1,X2,X3,X4 should have independent distributions. They should be between 0 and 1 and all add up to 1. Is this still possible with Robert's method? Thanks I don't think so. A whileago you wrote The numbers should be uniformly distributed (but in the context of an example where you had 5 variable; now you are back to 4 variables). Let's take the 4-case first. The two linear constraints confine the point (X1,X2,X3,X4) to a triangular region within the 4-dimensional unit cube. Say it has vertices A, B, C. You could then start by generating points uniformly distributed over a specific triangle in 2 dimentions, say the one with vertices at A0=(0,0), B0=(0,1), C0=(1,0). This is easy. Then you need to find a linear transformation which will map this triangle (A0,B0,C0) onto the triangle (A,B,C). Then the points you have sampled in (A0,B0,C0) will map into points which are uniformly distributed over the triangle (A,B,C). More generally, you will be seeking to generate points uniformly distributed over a simplex. For example, the case (your earlier post) of 5 points with 2 linear constraints requires a tetrahedron with vertices (A,B,C,D) in 5 dimensions whose coordinates you will have to find. Then take an easy tetrahedron with vertices (A0,B0,C0,D0) and sample uniformly within this. Then find a linear mapping from (A0,B0,C0,D0) to (A,B,C,D) and apply this to the sampled points. This raises a general question: Does anyone know of an R function to sample uniformly in the interior of a general (k-r)-dimensional simplex embedded in k dimensions, with (k+1) given vertices? snip The method of rejection: 1. Generate numbers randomly in the hypercube. 2. Test to see if the point falls within the prescribed area. 3. Accept the point if it does. 4. Repeat if it doesn't. Efficiency depends upon the ratio of volumes involved. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] peak finding
At 10:48 PM 3/24/2008, Chistos wrote: John, There is a peak finding algorithm attributed to Prof. Ripley in the R-help archive. It has a span parameter which might give you something close to what you seem to be looking for. http://finzi.psych.upenn.edu/R/Rhelp02a/archive/33097.html -Christos Finding peaks and valleys has several issues: 1. Impact of noise. 2. Mathematical smoothness of the underlying signal. 3. Boundary conditions at the beginning and end of the data series. 4. Bias from smoothing. If the noise is not too bad, I'd try fitting a smoothing spline, and then use the null derivative points to punctuate the extrema. If the noise is severe, you probably will need some domain knowledge, and will end up with perhaps locally weighted regression, followed by extrema search. For cases where the noise is trivial, the following short function might be useful in picking off peaks (and symmetrically modified, valleys): #findpeaks() #Copyright 2007 by Robert A LaBudde, all rights reserved #find peaks in (x,y) curve findpeaks- function(x, y) { nx- length(x) ny- length(y) if (nx != ny) { print (' findpeaks01: x and y must be same length!') stop } ipeak- NULL xpeak- NULL ypeak- NULL yprv- y[1] for (i in 1:ny) { ynext- ifelse(i==ny,y[ny],y[i+1]) if(yprv y[i] y[i] ynext) { #found local peak ipeak- c(ipeak,i) xpeak- c(xpeak,x[i]) ypeak- c(ypeak,y[i]) } yprv- y[i] } return(data.frame(ipeak,x=xpeak,y=ypeak)) } Trivial though it may be, it actually works quite well for its intended purpose. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] stepAIC and polynomial terms
At 08:50 PM 3/16/2008, caspar wrote: Dear all, I have a question regarding the use of stepAIC and polynomial (quadratic to be specific) terms in a binary logistic regression model. I read in McCullagh and Nelder, (1989, p 89) and as far as I remember from my statistics cources, higher-degree polynomial effects should not be included without the main effects. If I understand this correctly, following a stepwise model selection based on AIC should not lead to a model where the main effect of some continuous covariate is removed, but the quadratic term is kept. The question is, should I keep the quadratic term (note, there are other main effects that were retained following the stepwise algorithm) in the final model or should I delete it as well and move on? Or should I retain the main effect as well? To picture it, the initial model to which I called stepAIC is: Call: glm(formula = S ~ FR + Date * age + I(age^2), family = logexposure(ExposureDays = DATA$int), data = DATA) and the final one: Call: glm(formula = S ~ FR + Date + I(age^2), family = logexposure(ExposureDays = DATA$int), data = DATA) Thanks very much in advance for your thoughts and suggestions, Caspar 1. You should only exclude age as a linear term if you have sound causal reason for believing a pure quadratic component is solely present. Based on your example, you probably don't have this. 2. You also need to work about interactions. 3. An alternative to your polynomial approach to such a causal variable as age is to categorize age into 5 or 10 year intervals, and see how the fit breaks down by these levels. 4. You should plot your data vs. age to see what the dependence is. Frequently curve is flat up to a certain age, and then linear thereafter. This gives rise to a pseudo-quadratic relationship. You should be able to fit it better with the split plus a linear term. 5. Think about how age should affect your response before trying models. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to do multi-factor stratified sampling in R
At 03:54 PM 3/8/2008, David Winsemius wrote: Robert A. LaBudde [EMAIL PROTECTED] wrote in news:[EMAIL PROTECTED]: Given a set of data with a number of variables plus a response, I'd like to obtain a randomized subset of the rows such that the marginal proportions of each variable are maintained closely in the subset to that of the dataset, and possibly maintaining as well the two-factor interaction marginal proportions as well for some pairs. This must be a common problem in data mining, but I don't seem to be able to locate the proper library or function for doing this in R. Thanks for any help. Have you looked at the sampling package? I have never used it, but the strata() function appears to be capable. Thank you for pointing out this package and function. It is going to be very useful. I'll have to look into how I'm searching R, as 'sampling' and 'stratified' should have turned this up. I will spend some time looking at how strata() works to see how well it handles the problems I'm looking at. Thanks again. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [R} Getting 'tilting' confidence intervals in R
[Resend with proper subject line] I am trying to compute bootstrap confidence intervals for a sample using R 2.5.1 for Windows. I can get Normal, Basic, Percentile, BCa and ABC from boot.ci() and boot() in the Davison Hinkley boot package. But I can't figure out how to use tilt.boot() to get the tilting confidence interval. Here's a program snippet: g = rgamma(N,shape=2,scale=3) #Generate a sample of N random deviates varf = function (x,i) { var(x[i]) } b = boot(g, varf, R=1000) boot.ci(b) fabc = function (x, w) { sum(x^2*w)/sum(w)-(sum(x*w)/sum(w))^2 } #wgt average (biased) variance abc.ci(g, fabc) #ABC method confidence interval bt = tilt.boot(g, varf, R=c(1000,1000,1000)) The bt object doesn't have a confidence interval method or property, even though alpha=c(.025, .975) is a default parameter in the tilt.boot() call. This must be simple, but I'm not finding out the answer from the documentation. Also, use of any other packages to accomplish the tilting interval would also be useful. Thanks. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] applying math/stat functions to rows in data frame
At 12:02 PM 9/15/2007, Gerald wrote: Hi All, There are a variety of functions that can be applied to a variable (column) in a data frame: mean, min, max, sd, range, IQR, etc. I am aware of only two that work on the rows, using q1-q3 as example variables: rowMeans(cbind(q1,q2,q3),na.rm=T) #mean of multiple variables rowSums (cbind(q1,q2,q3),na.rm=T) #sum of multiple variables Can the standard column functions (listed in the first sentence) be applied to rows, with the use of correct indexes to reference the columns of interest? Or, must these summary functions be programmed separately to work on a row? Try using t() to transpose the matrix, and then apply the column function of interest. Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED] Least Cost Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive Tel: 757-467-0954 Virginia Beach, VA 23464-3239Fax: 757-467-2947 Vere scire est per causas scire __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.