> On Mar 17, 2017, at 11:33 AM, Alicia Ellis <alicia.m.el...@gmail.com> wrote: > > am cleaning some very messy health record lab data. Several of the rows > in the VALUE column have text entries and they need to be converted to > numeric in the NUMERIC_VALUE column based on the values in VALUE and > DESCRIPTION. For example: > > df <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than 0.30", > "12%", "<0.2", "Unknown"), > DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"), > NUMERIC_VALUE=c(9, 9,9,9,9,9,9)) > df > > df$NUMERIC_VALUE[df$VALUE == "Positive" & df$DESCRIPTION == "A"]=999999999 > > > However, I need to do this for ~500 unique pairings of VALUE and > DESCRIPTION entries. I'm trying to find an easy way to do this without > having to have 500 lines of code for each unique pairing. Some of the > pairings will be changed to the same value (e.g., 99999999, or -999999999) > but many will be unique numeric values.
I'm not convinced that you have the necessary scientific background to do this job properly. There are different sorts of lab tests: enzymatic activity, solute concentrations, and viral titers come to mind immediately for which the handling would be materially different. If you have a serum sodium level of less than 100 mEq/ml, then that is a value inconsistent with human life and the value should be set to NA. If you have a value of alkaline phosphatase that is 0 it is most suggestive of specimen mishandling. If you have a hepatitis B antigen level of "Positive", then it seems a perfectly informative value that should not be changed. Furthermore the attempt to change vlaues that you think are missing to 99999999, or -999999999 is simply wrongheaded. Learn to use the missing value indicator NA rather thna setting these to a numeric value. You should seek advice within your organization before you charge ahead with this strategy. -- David. > > > I've started by creating a new object called rules where a SUBSET of df > rows are included with the new value they should be changed to. > > > rules <- data.frame(VALUE = c("<60", "Negative", "Less than 0.30", "<0.2", > "Unknown"), > DESCRIPTION = c("A", "B", "C", "E", "E"), > NEW_VALUE=c(60, -999999,0.29,0.1,777777)) > rules > > > I tried doing a loop to change the values in df based on the suggested > value in rules: > > for (i in (1 : nrow(rules))) { > df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules > [i,2]]=rules[i,3] > } > df > > This gives the error: > > Error in Ops.factor(df$VALUE, rules[i, 1]) : > level sets of factors are differentwork and I think because when I write > > If I create rules using the exact same levels as df it works: > > rules <- data.frame(VALUE = c("<60", "Positive", "Negative", "Less than > 0.30", "12%", "<0.2", "Unknown"), > DESCRIPTION = c("A","A", "B", "C", "D", "E", "E"), > NEW_VALUE=c(60, 999999,-999999,0.29,12,0.1,777777)) > rules > > for (i in (1 : nrow(rules))) { > df$NUMERIC_VALUE[df$VALUE == rules[i,1] & df$DESCRIPTION == rules > [i,2]]=rules[i,3] > } > df > > > Can anyone suggest a way to modify my for loop so that it works for a > subset of rows in df and accomplish what I want? Or suggest a completely > different method that works? > > > Thanks! > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.