I am posting to this thread that has been quiet for some time because I remembered the following question.
Christophe Pallier wrote: > Hi, > > Can you provide examples of data formats that are problematic to read and > clean with R ? Today I had a data manipulation problem that I don't know how to do in R so I solved it with perl. Since I'm always interested in learning more about complex data manipulation in R I am posting my problem in the hopes of receiving some hints for doing this in R. If anyone has nothing better to do than play with other people's data, I would be happy to send the row files off-list. Background: I have been given data that contains two measurements of left ventricular ejection fraction. One of the methods is echocardiogram which sometimes gives a true quantitative value and other times a semi-quantitative value. The desire is to compare echo with the other method (MUGA). In most cases, patients had either quantitative or semi-quantitative. Same patients had both. The data came to me in excel files with, basically, no patient identifiers to link the "both" with the semi-quantitative patients (the "both" patients were in multiple data sets). What I wanted to do was extract from the semi-quantitative data file those patients with only semi-quantitative. All I have to link with are the semi-quantitative echo and the MUGA and these pairs of values are not unique. To make this more concrete, here are some portions of the raw data. "Both" "ID NUM","ECHO","MUGA","Semiquant","Quant" "B",12,37,10,12 "D",13,13,10,13 "E",13,26,10,15 "F",13,31,10,13 "H",15,15,10,15 "I",15,21,10,15 "J",15,22,10,15 "K",17,22,10,17 "N",17.5,4,10,17.5 "P",18,25,10,18 "R",19,25,10,19 Seimi-quantitative "echo","muga","quant" 10,20,0 <-- keep 10,20,0 <-- keep 10,21,0 <-- remove 10,21,0 <-- keep 10,24,0 <-- keep 10,25,0 <-- remove 10,25,0 <-- remove 10,25,0 <-- keep Here is the perl program I wrote for this. #!/usr/bin/perl open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv"; # Discard first row; $_ = <BOTH>; while(<BOTH>) { chomp; ($id, $e, $m, $sq, $qu) = split(/,/); $both{$sq,$m}++; } close(BOTH); open(OUT, "> qual_echo_only.csv") || die "Can't open qual_echo_only.csv"; print OUT "pid,echo,muga,quant\n"; $pid = 2001; open(QUAL, "qual_echo.csv") || die "Can't open qual_echo.csv"; # Discard first row $_ = <QUAL>; while(<QUAL>) { chomp; ($echo, $muga, $quant) = split(/,/); if ($both{$echo,$muga} > 0) { $both{$echo,$muga}--; } else { print OUT "$pid,$echo,$muga,$quant\n"; $pid++; } } close(QUAL); close(OUT); open(OUT, "> both_echo.csv") || die "Can't open both_echo.csv"; print OUT "pid,echo,muga,quant\n"; $pid = 3001; open(BOTH, "quant_qual_echo.csv") || die "Can't open quant_qual_echo.csv"; # Discard first row; $_ = <BOTH>; while(<BOTH>) { chomp; ($id, $e, $m, $sq, $qu) = split(/,/); print OUT "$pid,$sq,$m,0\n"; print OUT "$pid,$qu,$m,1\n"; $pid++; } close(BOTH); close(OUT); -- Kevin E. Thorpe Biostatistician/Trialist, Knowledge Translation Program Assistant Professor, Department of Public Health Sciences Faculty of Medicine, University of Toronto email: [EMAIL PROTECTED] Tel: 416.864.5776 Fax: 416.864.6057 ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.