Re: [R] Tools For Preparing Data For Analysis

Christophe Pallier Fri, 22 Jun 2007 08:39:45 -0700

If I understand correctly (from your Perl script)

1. you count the number of occurences of each "(echo, muga)" pairs in the
first file.


2. you remove from the second file the lines that correspond to these
occurences.

If this is indeed your aim, here's a solution in R:

cumcount <- function(x) {
 y <- numeric(length(x))
 for (i in 1:length(y)) {
     y[i] = sum(x[1:i] == x[i])
 }
 y
}

both <- read.csv('both_echo.csv')
v <- table(paste(both$echo, "_", both$muga, sep=""))

semi <- read.csv('qual_echo.csv')
s <- paste(semi$echo, "_", semi$muga, sep="")
cs = cumcount(s)
count = v[s]
count[is.na(count)]=0

semi2 <- data.frame(semi, s, cs, count, keep = cs > count)

> semi2
  echo muga quant     s cs count  keep
1   10   20     0 10_20  1     0  TRUE
2   10   20     0 10_20  2     0  TRUE
3   10   21     0 10_21  1     1 FALSE
4   10   21     0 10_21  2     1  TRUE
5   10   24     0 10_24  1     0  TRUE
6   10   25     0 10_25  1     2 FALSE
7   10   25     0 10_25  2     2 FALSE
8   10   25     0 10_25  3     2  TRUE


My code is not very readable...
Yet, the 'trick' of using an helper function like 'cumcount' might be
instructive.

Christophe Pallier


On 6/22/07, Kevin E. Thorpe <[EMAIL PROTECTED]> wrote:
>
> I am posting to this thread that has been quiet for some time because I
> remembered the following question.
>
> Christophe Pallier wrote:
> > Hi,
> >
> > Can you provide examples of data formats that are problematic to read
> and
> > clean with R ?
>
> Today I had a data manipulation problem that I don't know how to do in R
> so I solved it with perl.  Since I'm always interested in learning more
> about complex data manipulation in R I am posting my problem in the
> hopes of receiving some hints for doing this in R.
>
> If anyone has nothing better to do than play with other people's data,
> I would be happy to send the row files off-list.
>
> Background:
>
> I have been given data that contains two measurements of left
> ventricular ejection fraction.  One of the methods is echocardiogram
> which sometimes gives a true quantitative value and other times a
> semi-quantitative value.  The desire is to compare echo with the
> other method (MUGA).  In most cases, patients had either quantitative
> or semi-quantitative.  Same patients had both.  The data came
> to me in excel files with, basically, no patient identifiers to link
> the "both" with the semi-quantitative patients (the "both" patients
> were in multiple data sets).
>
> What I wanted to do was extract from the semi-quantitative data file
> those patients with only semi-quantitative.  All I have to link with
> are the semi-quantitative echo and the MUGA and these pairs of values
> are not unique.
>
> To make this more concrete, here are some portions of the raw data.
>
> "Both"
>
> "ID NUM","ECHO","MUGA","Semiquant","Quant"
> "B",12,37,10,12
> "D",13,13,10,13
> "E",13,26,10,15
> "F",13,31,10,13
> "H",15,15,10,15
> "I",15,21,10,15
> "J",15,22,10,15
> "K",17,22,10,17
> "N",17.5,4,10,17.5
> "P",18,25,10,18
> "R",19,25,10,19
>
> Seimi-quantitative
>
> "echo","muga","quant"
> 10,20,0      <-- keep
> 10,20,0      <-- keep
> 10,21,0      <-- remove
> 10,21,0      <-- keep
> 10,24,0      <-- keep
> 10,25,0      <-- remove
> 10,25,0      <-- remove
> 10,25,0      <-- keep
>
> Here is the perl program I wrote for this.
>
> #!/usr/bin/perl
>
> open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
> # Discard first row;
> $_ = <BOTH>;
> while(<BOTH>) {
>     chomp;
>     ($id, $e, $m, $sq, $qu) = split(/,/);
>     $both{$sq,$m}++;
> }
> close(BOTH);
>
> open(OUT, "> qual_echo_only.csv") || die "Can't open qual_echo_only.csv";
> print OUT "pid,echo,muga,quant\n";
> $pid = 2001;
>
> open(QUAL, "qual_echo.csv") || die "Can't open qual_echo.csv";
> # Discard first row
> $_ = <QUAL>;
> while(<QUAL>) {
>     chomp;
>     ($echo, $muga, $quant) = split(/,/);
>     if ($both{$echo,$muga} > 0) {
>         $both{$echo,$muga}--;
>     }
>     else {
>         print OUT "$pid,$echo,$muga,$quant\n";
>         $pid++;
>     }
> }
> close(QUAL);
> close(OUT);
>
> open(OUT, "> both_echo.csv") || die "Can't open both_echo.csv";
> print OUT "pid,echo,muga,quant\n";
> $pid = 3001;
>
> open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv";
> # Discard first row;
> $_ = <BOTH>;
> while(<BOTH>) {
>     chomp;
>     ($id, $e, $m, $sq, $qu) = split(/,/);
>     print OUT "$pid,$sq,$m,0\n";
>     print OUT "$pid,$qu,$m,1\n";
>     $pid++;
> }
> close(BOTH);
> close(OUT);
>
>
> --
> Kevin E. Thorpe
> Biostatistician/Trialist, Knowledge Translation Program
> Assistant Professor, Department of Public Health Sciences
> Faculty of Medicine, University of Toronto
> email: [EMAIL PROTECTED]  Tel: 416.864.5776  Fax: 416.864.6057
>
> ______________________________________________
> R-help@stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Christophe Pallier (http://www.pallier.org)

        [[alternative HTML version deleted]]

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Tools For Preparing Data For Analysis

Reply via email to