It is possible that there would be errors on the same row for different columns. This does not happen in your example. If row 4 was "John6, 3BC, 175X" then row 4 would be included 3 times, but we only need to remove it once. Removing the duplicates is not necessary since R would not get confused, but length(unique(c(BadName, BadAge, BadWeight)) indicates how many lines are being removed.
David On Sat, Jan 29, 2022 at 8:32 PM Val <valkr...@gmail.com> wrote: > Thank you David for your help. I just have one question on this. What is > the purpose of using the "unique" function on this? (dat2 <- > dat1[-unique(c(BadName, BadAge, BadWeight)), ]) I got the same result > without using it. ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > ZjQcmQRYFpfptBannerEnd > Thank you David for your help. > > I just have one question on this. What is the purpose of using the > "unique" function on this? > (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ]) > > I got the same result without using it. > (dat2 <- dat1[-(c(BadName, BadAge, BadWeight)), ]) > > My concern is when I am applying this for the large data set the "unique" > function may consume resources(time and memory). > > Thank you. > > On Sat, Jan 29, 2022 at 12:30 AM David Carlson <dcarl...@tamu.edu> wrote: > >> Given that you know which columns should be numeric and which should be >> character, finding characters in numeric columns or numbers in character >> columns is not difficult. Your data frame consists of three character >> columns so you can use regular expressions as Bert mentioned. First you >> should strip the whitespace out of your data: >> >> dat1 <-read.table(text="Name, Age, Weight >> Alex, 20, 13X >> Bob, 25, 142 >> Carol, 24, 120 >> John, 3BC, 175 >> Katy, 35, 160 >> Jack3, 34, 140",sep=",", header=TRUE, stringsAsFactors=FALSE, >> strip.white=TRUE) >> >> Now check to see if all of the fields are character as expected. >> >> sapply(dat1, typeof) >> # Name Age Weight >> # "character" "character" "character" >> >> Now identify character variables containing numbers and numeric variables >> containing characters: >> >> BadName <- which(grepl("[[:digit:]]", dat1$Name)) >> BadAge <- which(grepl("[[:alpha:]]", dat1$Age)) >> BadWeight <- which(grepl("[[:alpha:]]", dat1$Weight)) >> >> Next remove those rows: >> >> (dat2 <- dat1[-unique(c(BadName, BadAge, BadWeight)), ]) >> # Name Age Weight >> # 2 Bob 25 142 >> # 3 Carol 24 120 >> # 5 Katy 35 160 >> >> You still need to convert Age and Weight to numeric, e.g. dat2$Age <- >> as.numeric(dat2$Age). >> >> David Carlson >> >> >> On Fri, Jan 28, 2022 at 11:59 PM Bert Gunter <bgunter.4...@gmail.com> >> wrote: >> >>> As character 'polluted' entries will cause a column to be read in (via >>> read.table and relatives) as factor or character data, this sounds like a >>> job for regular expressions. If you are not familiar with this subject, >>> time to learn. And, yes, ZjQcmQRYFpfptBannerStart >>> This Message Is From an External Sender >>> This message came from outside your organization. >>> ZjQcmQRYFpfptBannerEnd >>> >>> As character 'polluted' entries will cause a column to be read in (via >>> read.table and relatives) as factor or character data, this sounds like a >>> job for regular expressions. If you are not familiar with this subject, >>> time to learn. And, yes, some heavy lifting will be required. >>> See ?regexp for a start maybe? Or the stringr package? >>> >>> Cheers, >>> Bert >>> >>> >>> >>> >>> On Fri, Jan 28, 2022, 7:08 PM Val <valkr...@gmail.com> wrote: >>> >>> > Hi All, >>> > >>> > I want to remove rows that contain a character string in an integer >>> > column or a digit in a character column. >>> > >>> > Sample data >>> > >>> > dat1 <-read.table(text="Name, Age, Weight >>> > Alex, 20, 13X >>> > Bob, 25, 142 >>> > Carol, 24, 120 >>> > John, 3BC, 175 >>> > Katy, 35, 160 >>> > Jack3, 34, 140",sep=",",header=TRUE,stringsAsFactors=F) >>> > >>> > If the Age/Weight column contains any character(s) then remove >>> > if the Name column contains an digit then remove that row >>> > Desired output >>> > >>> > Name Age weight >>> > 1 Bob 25 142 >>> > 2 Carol 24 120 >>> > 3 Katy 35 160 >>> > >>> > Thank you, >>> > >>> > ______________________________________________ >>> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>> > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$ >>> > PLEASE do read the posting guide >>> > https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$ >>> > and provide commented, minimal, self-contained, reproducible code. >>> > >>> >>> [[alternative HTML version deleted]] >>> >>> ______________________________________________r-h...@r-project.org mailing >>> list -- To UNSUBSCRIBE and more, >>> seehttps://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-help__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVXhZB_0c$ >>> PLEASE do read the posting guide >>> https://urldefense.com/v3/__http://www.R-project.org/posting-guide.html__;!!KwNVnqRv!QW1WPKY5eSNT7sMW28dnAKV7IXWvIc4UwOwUHkJgJ8uuGUrIAXvRjZWVRmZSfcI$ >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.