[R] Extract complete rows by group and maximum
Hi I'm trying to extract complete rows from a dataframe by group based on the maximum in a column within that group. Thus I have a dataframe: cvd_basestudy ... es_time ... _ study1... 0.3091667 study2... 0.3091667 study2... 0.2625000 study3... 0.303 study3... 0.2625000 __ etc I can extract the basestudy and the max(es_time) using ddply ddply(datares_sinus_variable, .(cvd_basestudy), function(x){max(x[['es_time']])}) or by by(datares_sinus_variable$es_time, datares_sinus_variable$cvd_basestudy, max) but how do I extract the whole line so that I can get a dataframe with all the data for the maximum line? (dput output from first 5 rows of my actual dataframe follows) Any help would be much appreciated. Thanks in advance Sandy Small structure(list(cvd_basestudy = c(study1, study2, study2, study3, study3), ecd_rhythm = structure(c(5L, 5L, 5L, 5L, 5L), .Label = c(AF, FLUTTER, PACED AF, SCRAP, SINUS, UNSURE), class = factor), cvd_frame_mode = structure(c(2L, 2L, 2L, 2L, 2L), .Label = c(fixed_time, variable_time), class = factor), cvd_part_fmt = structure(c(4L, 4L, 4L, 4L, 4L), .Label = c(first, last, mid, whole), class = factor), cvd_prev_fmt = structure(c(1L, 2L, 1L, 3L, 2L), .Label = c(All, Best, Q1, Q2, Q3, Q4), class = factor), cvd_cur_fmt = structure(c(5L, 5L, 1L, 4L, 4L), .Label = c(All, Best, Q1, Q2, Q3, Q4), class = factor), ps_pt = c(1, 1, 2, 1, 2), es_pt = c(8, 8, 8, 8, 8), ed_pt = c(21, 21, 18, 17, 18), cvd_median_limit = c(1.057, 1.057, 1.048, 1.037, 1.05), cvd_average_beat = c(1.06, 1.06, 1.05, 1.04, 1.05), limit = c(0.9, 0.9, 0.9, 0.9, 0.9), sstd_mi = c(FALSE, FALSE, FALSE, FALSE, FALSE), sstd_hbp = c(FALSE, FALSE, FALSE, FALSE, FALSE), sstd_ptca = c(FALSE, FALSE, FALSE, FALSE, FALSE), sstd_cabg = c(TRUE, TRUE, TRUE, TRUE, TRUE), sstd_norm_perf = c(FALSE, FALSE, FALSE, FALSE, FALSE), sstd_posnegett = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = c(-, +), class = factor), sstd_function = structure(c(NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = c(MODERATE, NORMAL, POOR, VERY POOR), class = factor), cvd_cur_fmt_n = c(3, 3, NA, 2, 2), cvd_prev_fmt_n = c(NA, NA, NA, 1, NA), cvd_cur_fmt2 = structure(c(3L, 3L, 1L, 3L, 3L), .Label = c(All, Best, Quartiles), class = factor), cvd_prev_fmt2 = structure(c(1L, 2L, 1L, 3L, 2L), .Label = c(All, Best, Quartiles), class = factor), es_time = c(0.3091667, 0.3091667, 0.2625, 0.303, 0.2625), es_time_err = c(0.04416667, 0.04416667, 0.04375, 0.0433, 0.04375), ed_time = c(0.5741667, 0.5741667, 0.4375, 0.39, 0.4375)), .Names = c(cvd_basestudy, ecd_rhythm, cvd_frame_mode, cvd_part_fmt, cvd_prev_fmt, cvd_cur_fmt, ps_pt, es_pt, ed_pt, cvd_median_limit, cvd_average_beat, limit, sstd_mi, sstd_hbp, sstd_ptca, sstd_cabg, sstd_norm_perf, sstd_posnegett, sstd_function, cvd_cur_fmt_n, cvd_prev_fmt_n, cvd_cur_fmt2, cvd_prev_fmt2, es_time, es_time_err, ed_time ), row.names = c(651, 655, 656, 661, 663), class = data.frame) This message may contain confidential information. If yo...{{dropped:21}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Finding the dominant factor in an unbalanced group
Hi all. This is perhaps more a statistics question but I'm hoping someone can help me. I have a group of patients for whom I'm looking at beat to beat RR interval changes. I have plotted the difference between one beat length and the next against the difference between the previous beat length and the current one. This gives me a plot with four quadrants: the bottom left corresponding to successively shorter beats, the top right to succesively longer, the top left to a shorter followed by a longer beat and the bottom right to a longer by a shorter. In theory if successive changes in beat length are random there should be an approximately equal number of counts in each quadrant of my plot. I have a dataframe which for each of my patients lists the number of counts in each quadrant: (dput data at the end of this mail) I can determine whether the distribution is balanced or not with a Chi Squared (chisq.test) However what I would like to do is determine whether there is a dominant quadrant (eg. CBP06118 in the example data), or a dominant pair of quadrants (eg CBP06036 in the example data) and if so which they are. If my dataset were only 10 patients it probably wouldn't be a problem (although I'm not certain what statistical check I could do beyond re-applying chisquared tests with only the relevant quadrants which sounds dodgy to me) the problem occurs because my data set is a couple of orders of magnitude bigger, Can anyone help? dput data is: structure(list(basestudy = structure(1:10, .Label = c(CBP06036, CBP06095, CBP06098, CBP06100, CBP06112, CBP06118, CBP06127, CBP06158, CBP06163, CBP06166), class = factor), tl = c(302L, 211L, 347L, 223L, 178L, 230L, 243L, 278L, 391L, 252L), tr = c(99L, 134L, 171L, 210L, 158L, 252L, 89L, 247L, 258L, 168L), br = c(305L, 212L, 346L, 223L, 178L, 231L, 244L, 277L, 388L, 254L), bl = c(142L, 288L, 284L, 191L, 144L, 360L, 147L, 184L, 164L, 186L)), .Names = c(basestudy, tl, tr, br, bl), row.names = c(NA, 10L), class = data.frame) Many thanks -- Sandy Small Clinical Physicist NHS Greater Glasgow and Clyde and NHS Forth Valley This message may contain confidential information. If yo...{{dropped:21}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ggplot2, geom_hline and facet_grid
Thank you. That seems to work - also on my much larger data set. I'm not sure I understand why it has to be defined as a factor, but if it works... Sandy Dennis Murphy wrote: Hi Sandy: I can reproduce your problem given the data provided. When I change ecd_rhythm from character to factor, it works as you intended. str(lvefeg) List of 4 ### Interesting... $ cvd_basestudy: chr [1:10] CBP05J02 CBP05J02 CBP05J02 CBP05J02 ... $ ecd_rhythm : chr [1:10] AF AF AF AF ... $ fixed_time : num [1:10] 30.9 33.2 32.6 32.1 30.9 ... $ variable_time: num [1:10] 29.4 32 30.3 33.7 28.3 ... - attr(*, row.names)= int [1:10] 1 2 3 4 5 6 7 9 10 11 class(lvefeg) [1] cast_dfdata.frame lvefeg$ecd_rhythm - factor(lvefeg$ecd_rhythm) p - qplot((variable_time + fixed_time) /2 , variable_time - fixed_time, data = lvefeg, geom='point') p p + facet_grid(ecd_rhythm ~ .) + geom_hline(yintercept=0) Does that work on your end? (And thank you for the reproducible example. Using dput() allows us to see what you see, which is very helpful.) HTH, Dennis On Wed, Jan 19, 2011 at 1:30 PM, Small Sandy (NHS Greater Glasgow Clyde) [1]sandy.sm...@nhs.net wrote: Hi Still having problems in that when I use geom_hline and facet_grid together I get two extra empty panels A reproducible example can be found at: [2]https://gist.github.com/786894 Sandy Small From: [3]h.wick...@gmail.com [[4]h.wick...@gmail.com] On Behalf Of Hadley Wickham [[5]had...@rice.edu] Sent: 19 January 2011 15:11 To: Small Sandy (NHS Greater Glasgow Clyde) Cc: [6]r-help@r-project.org Subject: Re: [R] ggplot2, geom_hline and facet_grid Hi Sandy, It's difficult to know what's going wrong without a small reproducible example ([7]https://github.com/hadley/devtools/wiki/Reproducibility) - could you please provide one? You might also have better luck with an email directly to the ggplot2 mailing list. Hadley On Wed, Jan 19, 2011 at 2:57 AM, Sandy Small [8]sandy.sm...@nhs.net wrote: Having upgraded to R version 2.12.1 I still have the same problem: The combination of facet_grid and geom_hline produce (for me) 4 panels of which two are empty of any data or lines (labelled 1 and 2). Removing either the facet_grid or the geom_hline gives me the result I would then expect. I have tried forcing the rhythm to be a factor Anyone have any ideas? Sandy Dennis Murphy wrote: Hi: The attached plot comes from the following code: g - ggplot(data =f, aes(x = (variable_time + fixed_time)/2, y variable_time - fixed_time)) g + geom_point() + geom_hline(yintercept =) + facet_grid(ecd_rhythm ~ .) Is this what you were expecting? sessionInfo() R version 2.12.1 Patched (2010-12-18 r53869) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=glish_United States.1252 [2] LC_CTYPE=glish_United States.1252 [3] LC_MONETARY=glish_United States.1252 [4] LC_NUMERIC=nbsp; [5] LC_TIME=glish_United States.1252 attached base packages: [1] splines stats graphics grDevices utils datasets grid [8] methods base other attached packages: [1] data.table_1.5.1 doBy_4.2.2 R2HTML_2.2 contrast_0.13 [5] Design_2.3-0 Hmisc_3.8-3 survival_2.36-2 sos_1.3-0 [9] brew_1.0-4 lattice_0.19-17 ggplot2_0.8.9proto_0.3-8 [13] reshape_0.8.3plyr_1.4 loaded via a namespace (and not attached): [1] cluster_1.13.2 digest_0.4.2 Matrix_0.999375-46 reshape2_1.1 [5] stringr_0.4tools_2.12.1 HTH, Dennis On Tue, Jan 18, 2011 at 1:46 AM, Small Sandy (NHS Greater Glasgow Clyde) [9]sandy.sm...@nhs.net [10]ailto:sandy.sm...@nhs.net%22 wrote: Hi I have a long data set on which I want to do Bland-Altman style plots for each rhythm type Using ggplot2, when I use geom_hline with facet_grid I get an extra set of empty panels. I can't get it to do it with the Diamonds data supplied with the package so here is a (much abbreviated) example: lvexs cvd_basestudy ecd_rhythm fixed_time variable_time 1 CBP05J02 AF30.9000 29.4225 2 CBP05J02 AF33.1700 32.0350 3 CBP05J02 AF32.5700 30.2775 4 CBP05J02 AF32.0550 33.7275 5 CBP05J02 SINUS30.9175 28.3475 6 CBP05J02 SINUS30.5725 29.7450 7 CBP05J02 SINUS33.
Re: [R] ggplot2, geom_hline and facet_grid
Having upgraded to R version 2.12.1 I still have the same problem: The combination of facet_grid and geom_hline produce (for me) 4 panels of which two are empty of any data or lines (labelled 1 and 2). Removing either the facet_grid or the geom_hline gives me the result I would then expect. I have tried forcing the rhythm to be a factor Anyone have any ideas? Sandy Dennis Murphy wrote: Hi: The attached plot comes from the following code: g - ggplot(data =f, aes(x = (variable_time + fixed_time)/2, y variable_time - fixed_time)) g + geom_point() + geom_hline(yintercept =) + facet_grid(ecd_rhythm ~ .) Is this what you were expecting? sessionInfo() R version 2.12.1 Patched (2010-12-18 r53869) Platform: x86_64-pc-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=glish_United States.1252 [2] LC_CTYPE=glish_United States.1252 [3] LC_MONETARY=glish_United States.1252 [4] LC_NUMERIC=nbsp; [5] LC_TIME=glish_United States.1252 attached base packages: [1] splines stats graphics grDevices utils datasets grid [8] methods base other attached packages: [1] data.table_1.5.1 doBy_4.2.2 R2HTML_2.2 contrast_0.13 [5] Design_2.3-0 Hmisc_3.8-3 survival_2.36-2 sos_1.3-0 [9] brew_1.0-4 lattice_0.19-17 ggplot2_0.8.9proto_0.3-8 [13] reshape_0.8.3plyr_1.4 loaded via a namespace (and not attached): [1] cluster_1.13.2 digest_0.4.2 Matrix_0.999375-46 reshape2_1.1 [5] stringr_0.4tools_2.12.1 HTH, Dennis On Tue, Jan 18, 2011 at 1:46 AM, Small Sandy (NHS Greater Glasgow Clyde) sandy.sm...@nhs.net ailto:sandy.sm...@nhs.net%22 wrote: Hi I have a long data set on which I want to do Bland-Altman style plots for each rhythm type Using ggplot2, when I use geom_hline with facet_grid I get an extra set of empty panels. I can't get it to do it with the Diamonds data supplied with the package so here is a (much abbreviated) example: lvexs cvd_basestudy ecd_rhythm fixed_time variable_time 1 CBP05J02 AF30.9000 29.4225 2 CBP05J02 AF33.1700 32.0350 3 CBP05J02 AF32.5700 30.2775 4 CBP05J02 AF32.0550 33.7275 5 CBP05J02 SINUS30.9175 28.3475 6 CBP05J02 SINUS30.5725 29.7450 7 CBP05J02 SINUS33. 31.1550 9 CBP05J02 SINUS31.8350 30.7000 10 CBP05J02 SINUS34.0450 33.4800 11 CBP05J02 SINUS31.3975 29.8150 qplot((variable_time + fixed_time)/2, variable_time - fixed_time, data=exs) + facet_grid(ecd_rhythm ~ .) + geom_hline(yintercept=0) If I take out the geom_hline I get the plots I would expect. It doesn't seem to make any difference if I get the mean and difference separately. Can anyone explain this and tell me how to avoid it (and why does it work with the Diamonds data set? Any help much appreciated - thanks. Sandy Sandy Small Clinical Physicist NHS Forth Valley and NHS Greater Glasgow and Clyde This message may contain confidential information. If yo...{{dropped:21}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Comparing samples with widely different uncertainties
Hi This is probably more of a statistics question than a specific R question, although I will be using R and need to know how to solve the problem in R. I have several sets of data (ejection fraction measurements) taken in various ways from the same set of (~400) patients (so it is paired data). For each individual measurement I can make an estimate of the percentage uncertainty in the measurement. Generally the measurements in data set A are higher but they have a large uncertainty (~20%) while the measurements in data set Bare lower but have a small uncertainty (~4%). I believe, from the physiology, that the true value is likely to be nearer the value of A than of B. I need to show that, despite the uncertainties in the measurements (which are not themselves normally distributed), there is (or is not) a difference between the two groups, (a straight Wilcoxon signed ranks test shows a difference but it cannot include that uncertainty data). Can anybody suggest what I should be looking at? Is there a language here that I don't know? How do I do it in R? Many thanks for your help Sandy -- Sandy Small Clinical Physicist NHS Greater Glasgow and Clyde and NHS Forth Valley Phone: 01412114592 E-mail: sandy.sm...@nhs.net This message may contain confidential information. If yo...{{dropped:21}} __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Plotting ordered nominal data
Many thanks I was sure it was simple. That was exactly what I wanted. I should have clarified that I was looking for a box-plot. Thanks to all who responed. Sandy S Ellison wrote: Sandy, You can re-order a factor with df$Eyeball-factor(df$Eyeball, levels=c(Normal, Mild, Moderate, Severe), ordered=T) (assuming df is your data frame and that you want an _ordered_ factor; the latter is not essential to your plots) Incidentally, NULL isn't a particularly friendly item to find in a data frame. NULL often implies I'm not here at all while NA says I exist but I'm a missing value. For an example of when it might matter, try length(c(1,2,NULL,3)) #versus length(c(1,2,NA,3)) Steve E Sandy Small [1][EMAIL PROTECTED] 01/08/2008 16:21:29 Hi I'm sure this question has been asked before but I can't find it in the archives. I want to plot them in the order Normal, Mild, Moderate, Severe so that the trend (or not) is obvious. *** This email and any attachments are confidential. Any use...{{dropped:29}} References 1. mailto:[EMAIL PROTECTED] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Plotting ordered nominal data
Hi I'm sure this question has been asked before but I can't find it in the archives. I have a data frame which includes interval and ordered nominal results. It looks something like Measured Eyeball 46.5 Normal 43.5 Mild 56.2 Normal 41.1 Mild 37.8 Moderate 12.6 Severe 17.3 Moderate 39.1 Normal 26.7 Mild NULL Normal 27.9 NULL 68.1 Normal I want to plot the Measured value against the Eyeball value but if I simply plot it the Eyeball values are plotted in alphabetical order. I do not want to change the names as Normal, Mild, Moderate, Severe are standard but I want to plot them in the order Normal, Mild, Moderate, Severe so that the trend (or not) is obvious. Any help would be much appreciated. Many thanks Sandy *** This message may contain confidential and privileged information. If you are not the intended recipient you should not disclose, copy or distribute information in this e-mail or take any action in reliance on its contents. To do so is strictly prohibited and may be unlawful. Please inform the sender that this message has gone astray before deleting it. Thank you. 2008 marks the 60th anniversary of the NHS. It's an opportunity to pay tribute to the NHS staff and volunteers who help shape the service, and celebrate their achievements. If you work for the NHS and would like an NHSmail email account, go to: www.connectingforhealth.nhs.uk/nhsmail __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Normalizing grouped data in a data frame
Hi I am a newbie to R but have tried a number of ways in R to do this and can't find a good solution. (I could do it out of R in perl or awk but would like to know how to do this in R). I have a large data frame 49 variables and 7000 observations however for simplicity I can express it in the following data frame Base, Image, LVEF, ES_Time A, 1, 4.32, 0.89 A, 2, 4.98, 0.67 A, 3, 3.7, 0.5 A, 3. 4.1, 0.8 B, 1, 7.4, 0.7 B, 3, 7.2, 0.8 B, 4, 7.8, 0.6 C, 1, 5.6, 1.1 C, 4, 5.2, 1.3 C, 5, 5.9, 1.2 C, 6, 6.1, 1.2 C, 7. 3.2, 1.1 For each value of LVEF and ES_Time I would like to normalise the value to the maximum for that factor grouped by Base or Image number, adding an extra column to the data frame with the normalised value in it. So for the Base = B group in the data frame (the data frame should have the same length I'm just showing the B part) I would get a modified data frame as follows. Base, Image, LVEF, ES_Time, Norm_LVEF, Norm_ES_Time ... B,1,7.4, 0.7, 7.4/7.8, 0.7/0.8 B, 3, 7.2, 0.8, 7.2/7.8, 0.8/0.8 B, 4, 7.8, 0.6, 7.8/7.8, 0.6/0.8 ... Where the results of the division would replace the division shown here. I hope this makes sense. If anyone can help I would be very grateful. Sandy Small NHS Glasgow, UK ** This message may contain confidential and privileged information. If you are not the intended recipient please accept our apologies. Please do not disclose, copy or distribute information in this e-mail or take any action in reliance on its contents: to do so is strictly prohibited and may be unlawful. Please inform us that this message has gone astray before deleting it. Thank you for your co-operation. NHSmail is used daily by over 100,000 staff in the NHS. Over a million messages are sent every day by the system. To find out why more and more NHS personnel are switching to this NHS Connecting for Health system please visit www.connectingforhealth.nhs.uk/nhsmail __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Normalizing grouped data in a data frame
Thank you very much. That works nicely. The trick I particularly needed was withinwhich I didn't know about. Also nice to get a data frame out with sparseby instead of just a mulit-array with by Sandy Duncan Murdoch wrote: Sandy Small wrote: Hi I am a newbie to R but have tried a number of ways in R to do this and can't find a good solution. (I could do it out of R in perl or awk but would like to know how to do this in R). I have a large data frame 49 variables and 7000 observations however for simplicity I can express it in the following data frame Base, Image, LVEF, ES_Time A, 1, 4.32, 0.89 A, 2, 4.98, 0.67 A, 3, 3.7, 0.5 A, 3. 4.1, 0.8 B, 1, 7.4, 0.7 B, 3, 7.2, 0.8 B, 4, 7.8, 0.6 C, 1, 5.6, 1.1 C, 4, 5.2, 1.3 C, 5, 5.9, 1.2 C, 6, 6.1, 1.2 C, 7. 3.2, 1.1 For each value of LVEF and ES_Time I would like to normalise the value to the maximum for that factor grouped by Base or Image number, adding an extra column to the data frame with the normalised value in it. So for the Base = B group in the data frame (the data frame should have the same length I'm just showing the B part) I would get a modified data frame as follows. Base, Image, LVEF, ES_Time, Norm_LVEF, Norm_ES_Time ... B,1,7.4, 0.7, 7.4/7.8, 0.7/0.8 B, 3, 7.2, 0.8, 7.2/7.8, 0.8/0.8 B, 4, 7.8, 0.6, 7.8/7.8, 0.6/0.8 ... Where the results of the division would replace the division shown here. I hope this makes sense. If anyone can help I would be very grateful. You want to look at the by(), tapply() or sparseby() functions (the latter in the reshape package, the others are in base R). For example, I think this untested code does what you want: newdf - sparseby(olddf, c(Base, Image), function(subset) within(subset, { Norm_LVEF - LVEF/max(LVEF) Norm_ES_Time - ES_Time/max(ES_Time) })) where olddf is the old dataframe, and newdf is newly created. Duncan Murdoch ** This message may contain confidential and privileged information. If you are not the intended recipient please accept our apologies. Please do not disclose, copy or distribute information in this e-mail or take any action in reliance on its contents: to do so is strictly prohibited and may be unlawful. Please inform us that this message has gone astray before deleting it. Thank you for your co-operation. NHSmail is used daily by over 100,000 staff in the NHS. Over a million messages are sent every day by the system. To find out why more and more NHS personnel are switching to this NHS Connecting for Health system please visit www.connectingforhealth.nhs.uk/nhsmail __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.